Open access

Introductory Chapter: Current State and Achievements of Data Augmentation

Written By

Robertas Damaševičius

Published: 29 May 2024

DOI: 10.5772/intechopen.112284

From the Edited Volume

Deep Learning - Recent Findings and Research

Edited by Manuel Domínguez-Morales, Javier Civit-Masot, Luis Muñoz-Saavedra and Robertas Damaševičius

Chapter metrics overview

20 Chapter Downloads

View Full Metrics

1. Introduction

Artificial intelligence (AI) models assume a growing role in biomedical imaging and health services. However, the development of AI systems as clinical decision support systems in the real-life setting presents several challenges [1]. One of these challenges is the scarcity of data, particularly in domains such as healthcare, where data is inherently limited and unbalanced. Also, datasets can be unreachable due to privacy matters or the lack of data-sharing incentives [2]. Data augmentation, particularly through generative models, has emerged as a significant approach to address these challenges. It allows for the generation of synthetic data, thereby expanding the available dataset for training AI models. This not only enhances the performance of these models but also enables their application in data-scarce scenarios. Data augmentation allows to expand the diversity of data used for training models while skipping the need to acquire additional data. Padding, cropping, and horizontal flipping are standard data augmentation approaches employed to train massive neural networks [3].

In the domain of AI and particularly in image processing, data augmentation plays a crucial role. It not only helps in preventing overfitting but also provides a means to enhance the performance of deep learning models [4]. In image processing, data augmentation can generate visually diverse images that can improve the robustness of models to new, unseen data [5].

This chapter aims to present an overview of the current state and achievements of data augmentation, discussing its impact, challenges, and limitations, and exploring future emerging trends in the field.

Advertisement

2. Current state of data augmentation

2.1 Formal definition

The objective of data augmentation is to create a diverse set of transformed data points that can help enhance the performance of machine learning models. Data augmentation can be formally defined as a process that generates a set of transformed data points from an original dataset. Let us denote the original dataset as D=x1x2xn, where xi is a data point. The data augmentation process can be represented as a function f:XX, where X is the space of original data points and X is the space of augmented data points. For each data point xiD, the data augmentation function f generates a set of transformed data points Di=xi1xi2xim, where xij=fxi and m is the number of augmented data points generated from xi. The augmented dataset D is the union of all Di, i.e., D=i=1nDi. This process can be represented as follows:

D=i=1nfxixiDE1

2.2 Standard techniques in image processing

Rotation is a common data augmentation technique in image processing. It involves rotating the image by a certain angle. This can help to make the model invariant to the orientation of the object in the image. The rotation operation can be described as:

Rxθ=cosθsinθsinθcosθxE2

where x is the initial image and θ is the rotation angle.

Scaling involves resizing the image, either by making it larger (zooming in) or smaller (zooming out). This can help to make the model invariant to the size of the object in the image. The scaling operation can be described as:

Sxs=sxE3

where x is the initial image and s is the scaling factor.

Cropping involves cutting out a portion of the image. This can help to make the model focus on the important parts of the image. The cropping operation can be described as:

Cxr=xrE4

where x is the initial image and r is the region to be cropped.

Flipping involves reversing the image either horizontally or vertically. This can help to make the model invariant to the orientation of the object in the image. The flipping operation can be described as:

Fx=xE5

where x is the initial image and x is the flipped image.

Adjusting the brightness and contrast of the image can help to make the model invariant to different lighting conditions. The brightness and contrast adjustment operation can be represented as:

Bxb=x+bE6
Cxc=cxE7

where x is the initial image, b is the brightness adjustment, and c is the contrast adjustment.

2.3 Advanced techniques in image processing

Elastic deformations are a type of data augmentation technique that involves applying random, smooth transformations to an image. This can help to make the model invariant to small local deformations in the object’s shape. The elastic deformation operation can be described as:

Exασ=x+αGσrE8

where x is the initial image, r is a random field for each pixel, Gσ is a Gaussian filter with standard deviation σ, and α is a scaling factor.

Random erasing is a data augmentation technique that involves randomly selecting a rectangle in the image and replacing its pixels with random values. This can help to increase the robustness of the model to occlusion. The random erasing operation can be described as:

RExr=x1m+mvE9

where x is the initial image, r is the region to be erased, m is a mask that is 1 in the region r and 0 elsewhere, and v is a random value.

Mixup and CutMix are data augmentation techniques that involve creating new training examples by taking a convex combination of two training examples. For Mixup, this is done pixel-wise, while for CutMix, a region from one image is cut and pasted onto another image. The Mixup and CutMix operations can be described as:

Mixupx1x2λ=λx1+1λx2E10
CutMixx1x2r=x11m+mx2E11

where x1 and x2 are original images, λ is a random value between 0 and 1, r is the region to be cut and pasted, and m is a mask that is 1 in region r, and 0 elsewhere.

2.4 Generative models for data augmentation

2.4.1 Generative adversarial networks (GANs)

Generative adversarial networks (GANs) are a class of generative models proposed in Ref. [6]. Generative adversarial networks (GANs) are made of a generator and a discriminator, which are two neural networks trained simultaneously. The generator creates new data instances, whereas the discriminator determines whether or not each sample of data matches the real training dataset. The generator is trained to create data that the discriminator cannot separate from real data, while the discriminator is trained to get better at separating real data from created data. This is formally represented by the following minimax game between generator G and discriminator D:

minGmaxDVDG=ExpdataxlogDx+Ezpzzlog1DGzE12

where: x is a real data instance, z is a noise vector sampled from a prior noise distribution pzz, Gz is the data instance generated by the generator, Dx is the probability that the real data instance x is a real data sample (according to the discriminator), DGz is the probability that a fake data instance is a real data instance (according to the discriminator).

In the context of data augmentation, GANs can be employed to synthesize additional training data that is similar to the original training data. This can be particularly useful when the original dataset is small or imbalanced [7, 8].

2.4.2 Variational autoencoders (VAEs)

Variational autoencoders (VAEs) are a generative model that have been used for data augmentation. They consist of an encoder and a decoder, where the encoder maps input data to a latent space and the decoder maps from latent space back to the original data space. The key difference between VAEs and traditional autoencoders is that the latent space of VAEs is continuous, which is achieved by having the encoder output two vectors of means and standard deviations instead of a single encoding vector [9].

The formal mathematical definition of VAEs involves several components. Let X be the training data where each xi represents a data point. The encoder learns a mapping QθzX from an input xi to the mean μxi and covariance σ2xi vectors of the latent variables, where the latent variable z follows a normal distribution N01. The decoder learns a mapping PϕXz from the latent representation z to the distribution parameters of X. The objective function for training a VAE is given by:

LθϕX=EzzXlogPϕXzDKLQθzXPzE13

where θ,ϕ are the encoder and decoder parameters, and DKL is the Kullback-Leibler Divergence between two probability distributions [10].

Variational autoencoders (VAEs) have been used in various applications for data augmentation. For example, in the field of audio processing, VAEs have been used to augment data by learning to synthesize new audio data instances from the latent space [11]. In the field of medical imaging, VAEs have been used to generate synthetic medical images for training diagnostic models [12].

2.5 Data augmentation methods for natural language processing

Data augmentation (DA) is also applied in natural language processing, achieving improvements in several tasks [13]. The primary goal of DA approaches in natural language processing (NLP) is to increase the variety of training data, allowing the model to generalize to previously unknown testing data. Based on the variety of enhanced data, DA approaches in NLP may be divided into three types: paraphrase, noising, and sampling.

  • Paraphrasing is generating new sentences that convey the same meaning as the original sentence but with different wording. This can be achieved through methods such as back-translation, where a sentence is translated to a different language and then translated back to the original language.

  • Noising involves introducing noise into the original sentences, such as replacing, deleting, or inserting words. This can help the model become more robust to noise in real-world data.

  • Sampling involves generating new sentences by sampling from a language model trained on the original data. This helps to enhance the diversity of the training data.

In addition to these methods, there are also task-specific DA methods for NLP tasks, for example, named entity recognition (NER) and sentence classification. For example, the unified medical language system-easy data augmentation (UMLS-EDA) method extends the easy data augmentation (EDA) approach for biomedical NER by including the Unified Medical Language System (UMLS) knowledge, which can boost the model performance for both NER and sentence classification [14].

2.6 Data augmentation methods for audio data

Data augmentation for audio data is a crucial technique to enhance the results of machine learning models, especially when the available dataset is limited. Several methods have been proposed for this purpose:

  • Time stretching and pitch shifting: These methods involve changing the speed and pitch of the audio without affecting the other. Time stretching makes the audio longer or shorter, while pitch shifting raises or lowers the pitch of the audio. These methods can help the model become more robust to variations in speed and pitch in the audio data.

  • Adding noise: This involves adding background noise to the audio data. The noise can be white noise, pink noise, or real-world noise. This can help the model become more robust to noise in real-world audio data.

  • Time shifting: This involves shifting the audio in time by a certain amount. This can help the model become more robust to slight variations in timing.

  • Frequency masking: This involves masking certain frequency bands in the audio data. This can help the model become more robust to variations in frequency content.

  • Mixup: This involves creating new audio data by taking a weighted sum of two audio clips. The weights are chosen randomly for each pair of audio clips. This can help enhance the variability of the training data.

These methods can be used individually or in combination to augment audio data. However, the effectiveness of these methods can vary depending on the specific characteristics of the audio data and the task at hand [15].

2.7 Data augmentation methods for tabular data

Data augmentation for tabular data is a demanding task due to the structured nature of the data and the potential relationships between different columns. Several methods have been proposed to tackle this issue:

  • Resampling is most commonly used for imbalanced tabular data such as creating new samples by resampling the existing data. The two main types of resampling are oversampling, where new instances are synthesized from the minority class, and undersampling, where instances from the majority class are deleted.

  • Synthetic minority over-sampling technique (SMOTE) is an oversampling approach that creates artificial samples of the minority class by interpolating between existing minority samples [16]. It can help improve the performance of models on imbalanced tabular data.

  • Feature perturbation involves adding noise to the existing data to create new samples. The noise can be added to all features or only to a subset of features. This method can help improve the model’s robustness to noise in the data.

These methods can be used individually or in combination to augment tabular data. The effectiveness of these methods can vary depending on the specific features of the data and the task at hand.

Advertisement

3. Discussion

3.1 Achievements and impact of data augmentation

Data augmentation has been instrumental in improving the effectiveness of machine learning models, particularly in image recognition tasks. The use of data augmentation in deep learning applications in medical image analysis, for instance, has led to better results in diagnostic accuracy [17, 18, 19].

One of the key benefits of data augmentation is its ability to mitigate overfitting. Overfitting occurs when a model learns the training data too well, to the point where it performs poorly on unseen data. By creating a more diverse training dataset, data augmentation can help to prevent overfitting, thereby improving the model’s power to scale to new data.

Data augmentation is particularly useful in scenarios where data is scarce. Here, data augmentation can be used to artificially increase the size of the dataset. This has been demonstrated in various fields, including medical imaging and plant stress phenotyping, where data augmentation has enabled the development of robust machine learning models despite the limited availability of data [20].

3.2 Challenges and limitations of data augmentation

Another challenge of data augmentation is preserving the original data distribution. When augmenting data, it is crucial to ensure that the transformed data points do not significantly alter the overall distribution of the data. If the data augmentation process introduces bias or changes the data distribution, it can lead to misleading results and poor model performance [21].

Data augmentation can be particularly challenging for complex data types. For instance, in the context of network data, standard data augmentation techniques may not be applicable due to the complex interdependencies between data points. New methods and techniques are needed to effectively augment such complex data types [22].

Data augmentation can be computationally expensive, especially for large datasets and complex augmentation operations. This can increase the time and computational resources required for training machine learning models. However, the benefits of data augmentation in terms of improved model performance often outweigh these additional costs [23].

3.3 Future directions and emerging trends

Automated data augmentation, where the augmentation process is guided by machine learning algorithms, is a promising future direction. This approach can potentially generate more effective and diverse augmented data by learning the optimal transformations for each data point. Recent advancements in GANs have shown potential in this area such as in digital pathology [24].

As machine learning applications become more specialized, the need for domain-specific augmentation techniques is becoming increasingly apparent. For instance, in materials science, AI and machine learning are being used to discover new materials and understand their properties. In this context, domain-specific augmentation techniques can help generate more realistic and diverse data for training machine learning models [25].

The integration of data augmentation with active learning is another emerging trend. Active learning is a strategy where the model actively selects the most informative data points for training. By combining this with data augmentation, it may be possible to create a more efficient and effective learning process. This approach has shown promise in the field of cardiovascular imaging [26] and plant root segmentation [27].

Advertisement

4. Conclusion

In this chapter, we have explored the current state, achievements, challenges, and future directions of data augmentation in the context of artificial intelligence and image processing. We have seen how data augmentation techniques have evolved over time and how they have contributed to significant improvements in model performance, particularly in image recognition tasks. We have also discussed the challenges associated with data augmentation, including preserving the original data distribution, augmenting complex data types, and managing computational costs. Looking forward, we have highlighted emerging trends such as automated data augmentation, domain-specific augmentation techniques, and the integration of data augmentation with active learning.

As we move forward, it is clear that data augmentation will continue to play a crucial role in the field of artificial intelligence and image processing. The development of more sophisticated and automated data augmentation techniques, as well as the integration of data augmentation with other machine learning strategies, opens up exciting new possibilities for future research. However, it is also important to be mindful of the challenges and limitations of data augmentation, and to continue to develop strategies to address these issues. As we continue to push the boundaries of what is possible with artificial intelligence and image processing, data augmentation will undoubtedly remain a key tool in our arsenal.

References

  1. 1. Castiglioni I, Rundo L, Codari M, Di Leo G, Salvatore C, Interlenghi M, et al. Ai applications to medical images: From machine learning to deep learning. European Journal of Medical Physics. 2021;83:9-24
  2. 2. Williams B, Borroni D, Liu R, Zhao Y, Zhang J, Lim JWC, et al. An artificial intelligence-based deep learning algorithm for the diagnosis of diabetic neuropathy using corneal confocal microscopy: A development and validation study. Diabetologia. 2019;63(2):419-430
  3. 3. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. Curran Associates, Incorporated; 2012. pp. 1097-1105
  4. 4. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data. 2019;6(1):60
  5. 5. Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621. 2017
  6. 6. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Advances in Neural Information Processing Systems. Curran Associates, Incorporated; 2014. pp. 2672-2680
  7. 7. Antoniou A, Storkey A, Edwards H. Augmenting image classifiers using data augmentation generative adversarial networks. In: Artificial Neural Networks and Machine Learning–ICANN, 2018. Springer; 2018. pp. 570-582
  8. 8. Weng Y, Zhou H. Data augmentation computing model based on generative adversarial network. IEEE Access. 2019;7:75819-75828
  9. 9. Kingma D, P, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. 2013
  10. 10. Doersch C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. 2016
  11. 11. Pascual S, Bonafonte A, Serrà J. Melnet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083. 2019
  12. 12. Goodfellow I, Bengio Y, Courville A. Deep Learning (Adaptive Computation and Machine Learning Series). Adaptive Computation and Machine Learning Series. MIT Press; 2016
  13. 13. Li B, Hou Y, Che W. Data augmentation approaches in natural language processing: A survey. Artificial Intelligence Open. 2022;3:71-90
  14. 14. Kang T, Perotte AJ, Tang Y, Ta CN, Weng C. Umls-based data augmentation for natural language processing of clinical research literature. Journal of the American Medical Informatics Association. 2020;28(4):812-823
  15. 15. Abayomi-Alli OO, Damaševičius R, Qazi A, Adedoyin-Olowe M, Misra S. Data augmentation and deep learning methods in sound classification: A systematic review. Electronics. 2022;11(22)
  16. 16. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321-357
  17. 17. Ker J, Lin W, Rao JP, Lim TC. Deep learning applications in medical image analysis. IEEE Access. 2018;6:9375-9389
  18. 18. Abayomi-Alli OO, Damaševičius R, Misra S, MaskeliÅ«nas R, Abayomi-Alli A. Malignant skin melanoma detection using image augmentation by oversampling in nonlinear lower-dimensional embedding manifold. Turkish Journal of Electrical Engineering and Computer Sciences. 2021;29:2600-2614
  19. 19. Oyewola DO, Dada EG, Misra S, Damaševičius R. A novel data augmentation convolutional neural network for detecting malaria parasite in blood smear images. Applied Artificial Intelligence. 2022;36(1)
  20. 20. Singh AK, Ganapathysubramanian B, Sarkar S. Deep learning for plant stress phenotyping: Trends and future perspectives. Trends in Plant Science. 2018;23(10):883-898
  21. 21. Talebi H, Milanfar P. Nima: Neural image assessment. IEEE Transactions on Image Processing. 2018;27(8):3998-4011
  22. 22. Cranmer SJ, Leifeld P, McClurg SD, Rolfe M. Navigating the range of statistical tools for inferential network analysis. American Journal of Political Science. 2017;61(1):237-251
  23. 23. Lin Y, Li H, Xiao X, Zhang L, Wang K, Gregersen H, et al. Daism-dnnxmbd: Highly accurate cell type proportion estimation with in silico data augmentation and deep neural networks. Patterns. 2022;3(3):100440
  24. 24. Tschuchnig ME, Oostingh GJ, Gadermayr M. Generative adversarial networks in digital pathology: A survey on trends and future potential. Patterns. 2020;1(5):100089
  25. 25. Li J, Lim K, Yang H, Ren Z, Raghavan S, Chen P-Y, et al. Ai applications through the whole life cycle of material discovery. Matter. 2020;3(2):371-407
  26. 26. O’Regan D. Putting machine learning into motion: Applications in cardiovascular imaging. Clinical Radiology. 2020;75(1):5-13
  27. 27. Smith A, Petersen JK, Selvan R, Rasmussen CR. Segmentation of roots in soil with u-net. Plant Methods. 2020;16(1):1-14

Written By

Robertas Damaševičius

Published: 29 May 2024