Open access peer-reviewed chapter - ONLINE FIRST

Automated Data-Driven and Stochastic Imputation Method

Written By

Michal Koren and Or Peretz

Submitted: 23 December 2023 Reviewed: 03 January 2024 Published: 06 February 2024

DOI: 10.5772/intechopen.1004160

Association Rule Mining and Data Mining - Recent Advances, New Perspectives and Applications IntechOpen
Association Rule Mining and Data Mining - Recent Advances, New Pe... Edited by Jainath Yadav

From the Edited Volume

Recent Advances in Association Rule Mining and Data Mining [Working Title]

Dr. Jainath Yadav

Chapter metrics overview

37 Chapter Downloads

View Full Metrics

Abstract

Machine learning algorithms may have difficulty processing datasets with missing values. Identifying and replacing missing values is necessary before modeling the prediction for missing data. However, studies have shown that uniformly compensating for missing values in a dataset is impossible, and no imputation technique fits all datasets. This study presents an Automated and data-driven Stochastic Imputer (ASI). The proposed ASI is based on automated distribution detection and estimation of the imputed value by stochastic sampling with controlled error probability. The significant advantage of this method is the use of a data-driven approximation ratio for the stochastic sampling, which bound the samples to be, at most, one standard deviation from the original distribution. The ASI performance was compared to traditional deterministic and stochastic imputation algorithms over seven datasets. The results showed that ASI succeeded in 61.5% of cases compared to other algorithms, and its performance can be improved by controlling the sampling error probability.

Keywords

  • imputation techniques
  • machine learning
  • multidimensional data
  • stochastic processes
  • artificial intelligence

1. Introduction

Currently, most Artificial Intelligence (AI) research and development is conducted in industry and academia, where vast quantities of information are generated daily. Furthermore, developing and deploying machine learning systems requires large amounts of data. Developing AI procedures is challenging since complete datasets (without missing values) are required. It is possible to remove missing values from a dataset or to impute them (e.g., replace them with comparative values), depending on the feature type [1]. There are established rules to decide which strategy to use for different types of missing values. Researchers have found that there is no one way to compensate for missing values in datasets. Also, specific datasets and types of missing data may respond well to specific strategies but not others [2, 3]. Data analysis can be affected by missing data by increasing or decreasing the value of specific categories when there is a large amount of missing data. In particular, missing data could affect machine learning (ML) algorithms and result in inaccurate and biased analysis.

Generally, numeric features are imputed using mean imputation (to avoid outliers and keep the data centralized), median, or mode. For categorical features, a “missing” category is often added, or the most frequently occurring value assigned [4, 5, 6] is assigned. The use of ML algorithms and advanced statistical methods to complete missing values has been added to the conventional methods. KNN imputers [7, 8, 9] are commonly used to find neighbors and calculate values based on feature means. An additional imputation technique is a Multivariate Imputation by Chained Equations (MICE) algorithm [10]. MICE is a robust, informative method that imputes missing data in a dataset through an iterative series of predictive models [11, 12]. When examining more complex models, models with a higher learning capacity can complete missing values. Many deep learning architectures have been created to solve imputation challenges using the latent representation of the data in the hidden layers [13] and a combination of ML techniques [14], autoencoders [15], and multilayer perceptron [16]. In addition to the techniques discussed above, stochastic imputation techniques, such as regression models [17, 18, 19], add random errors to the target value and examine whether they correlate with the independent variable. Extrapolation and interpolation can be used to estimate unknown values by extending a known sequence of values or facts to unknown values [20, 21]. Conversely, non-deck techniques require substituting observations from “similar” units for each missing value [22, 23].

A new field of research is emerging in addition to machine learning techniques: automated machine learning (AutoML). ML problems can be solved more efficiently and effectively with AutoML [24, 25]. It requires search and optimization methods to find the best hyperparameters for a given problem [26]. It is challenging to impute the missing values in each feature, even with automatic methods [27]. This challenge may be due to high anomaly rates or data without patterns. As only some datasets have the same distribution and dependencies, human intervention is necessary to determine the appropriate value to impute.

The main challenge in the imputation process occurs among datasets with noisy distributions and anomalous values [28, 29]. When the distributions are inconsistent and contain many abnormal values, the imputation process becomes a challenge since all the statical measures are biased according to the behavior of the noises [30, 31, 32]. As part of stochastic processes and the use of randomization in algorithms, there is a need to allow a probability of failure. Since randomized decisions can change the entire process, analyzing the probability of the worst case of an algorithm is essential [33]. Moreover, it is useful to know the probability of whether the received answer is incorrect and handle it accordingly [34].

Three inequalities analyze the farthest value the random variable can be taken from its mean: (1) Markov inequality, which bounds the probability by the expected value of the variables [35]; (2) Chebyshev inequality, which bounds the probability by the variance [36, 37]; and (3) Chernoff inequality, which aims to achieve a binomial distribution (i.e., the sum of Bernoulli variables) that bounds the probability by exponential function [38, 39]. With the help of these inequalities, it is possible to develop random algorithms and control the result obtained by analyzing the correctness of the algorithm [40].

At times, complete independence can imply exponentially better bounds. Compared to Chebyshev, which only uses pairwise independence, Chernoff gives a tighter bound for deviation probability since it uses complete independence between the random variables. Although the Chernoff bound requires stronger assumptions, it is generally tighter than Markov and Chebyshev inequalities.

This study presents the Automated Stochastic Imputer (ASI), a new automated data-driven and stochastic method to impute numeric values in a dataset. It is based on the automated detection of distribution and estimation of the imputed value by sampling with controlled error probability. The innovation of this method is the use of a data-driven approximation ratio based on the distribution measures and the determination of the number of samples required for an accurate estimation. Section 2 presents the method, its implementation, correctness, and computational complexity analysis. Section 3 describes the empirical study and a detailed scenario in which the results of the proposed method are compared to the existing imputation algorithm, with the results presented in Section 4. Last, Section 5 discusses the main conclusions and suggestions for future directions.

Advertisement

2. Automated Stochastic Imputer

This section presents and describes the ASI method. First, the definitions and distributions used in this study will be presented, followed by the method and description of its implementation. Last, the correctness of the method and the computational complexity will be detailed.

2.1 Definitions

The following are the definitions and methods employed in this study:

  1. Chernoff bound [40] was used to bound the error probability. Let X1,,Xt be independent and identically distributed random variables with a range between zero to one, such that for all 1it it holds that EXi=μ. According to Chernoff bound, it holds that for any 0<ϵ<1, the probability of the sample’s mean average (i.e., 1ti=1tXi) to be ϵ-far from the distribution mean is:

    P1ti=1tXiμϵμ2eμtϵ23E1

  2. Let D be a distribution, as listed in Appendix A. To determine whether each feature is indeed close enough to D distribution, a Kolmogorov-Smirnov test is performed [41]. The Kolmogorov-Smirnov test is a general nonparametric method. This test compares the empirical cumulative distribution functions of a sample with a postulated theoretical distribution. The Kolmogorov-Smirnov test is performed for each feature. In this case, the test interpretation is as follows:

Ho: The sample follows aDdistribution.

H1: The sample does not follow aDdistribution.

2.2 Algorithm and implementation

Let F=F1Fm be a set of features in dataset D and let δ be the desired imputer failure probability, where 0<δ<1. First, the method is iterated over each feature fF and normalizes the values into a range between zero to one by min-max normalization, i.e., each value of vf is transformed to:

vminfmaxfminfE2

Next, the method estimates the distribution of f using the Kolmogorov-Smirnov test [41]. Let Df be the estimated distribution with a confidence level of α, and let fμ,fσ be the feature’s expected value and standard deviation, respectively. The method defines q, the number of required samples necessary to estimate missing values, with a probability of 1δ, as:

q=3fμfσlnδ2E3

Let Vf be a set of missing values in f. The method iterates over each uiVf and samples q independent and identically distributed values from Df, denoted as x1,,xq. Last, the method imputes the missing values according to their average.

Automated Stochastic Imputer (D,α,δ)

  1. FF1Fm set of features in D

  2. For fF

    • Df distribution of f with confidence level of α

    • q3fμfσ2lnδ2

    • Vf set of missing values in f

    • For uiVf:

      1. Sample x1,,xq from Df

      2. ui1qj=1qxj

  3. Return D

2.3 Correctness

Let D be a dataset consisting of m features, denoted as F=F1Fm. Initially, the method is iterated over the feature values and normalized by min-max normalization. For each value of fF, let vfbe transformed to:

vminfmaxfminfE4

Therefore, the dataset is normalized into a range between zero and one, preserving the original data distribution. For va,vbf, such that vavb, it holds that:

vaminfvbminfE5
vaminfmaxfminfvbminfmaxfminf.E6

Let Df be the detected distribution acquired by the Kolmogorov-Smirnov test with a confidence level of α, and let x1,,xqDf be the independent and identical distributed random variables, such that 1iq:Exi=μx and Vxi=σx2. The average of all samples, denoted as X¯, is defined as:

X¯=1qi=1qxiE7

Let μ,σ2 be the expected value and the variance of X¯, respectively. The method uses X¯ as a single imputed value. By the linearity of expectation, it holds that the expected value of X¯ is:

μ=EX¯=E1qi=1qxi=1qEi=1qxi=q·Exiq=μxE8

Given that all xi are independently and identically distributed, it holds that:

ij:Varxi+xj=Varxi+VarxjE9
σ2=VarX¯=Var1qi=1qxi=1q2i=1qVarxi=q·Varxiq2=σx2qE10

The probability error of each sample to be ϵ-far from the expected value of the feature can be bound by Chernoff inequality:

P1qi=1tXiμϵμ2eμqϵ23E11

Let ϵ=σxμx be the approximation factor. By bounding the failure probability with δ, a lower bound will be received for the number of samples required:

P1qi=1tXiμσxμxμx2eq·μxσxμx23<δE12

Thus, the probability of X¯ to be ϵ-far from the expected value is equal to its probability to be the distance of, at most, one standard deviation:

P1qi=1tXiμσxμxμx=P1qi=1tXiμσx2eq·μxσxμx23<δE13

Applying algebraic simplification:

q>3μxlnδ2σx2=3μxσx2lnδ2=3μxσx2lnδ2E14

Since all the dataset distributions are normalized into the range [0,1], its expected value and standard deviation are non-negative values. Given that 0<δ<1, the number of samples (i.e., q) is a positive number:

0<δ<1lnδ2<0E15
q>3μxσx2lnδ2>0E16

Therefore, it can be concluded that the total number of samples required to estimate an imputed value depends on the ratio between the expectation and the distribution variance.

Last, for a more accurate analysis of the imputed values, the following is the proof that all values will be typically distributed around the feature mean: Let Vf=m1mM be a set of M missing values in feature f. For all 1iq, it holds that xiDf, and by the imputation technique:

1jM:mj=1qi=1qxiE17

As long as M, by the central limit theorem, the imputed values will be normally distributed:

M¯=1Mj=1MmiNμxσx2ME18

2.4 Computational complexity

Let m be the number of features in a dataset with n records. For each feature with at least one missing value, the method iterates over the missing values. Therefore, the maximum number of such features is m. The upper bound of missing values in each feature is n. Let q be the number of samples required to estimate the missing value of each iteration. Thus, sampling and averaging their results require 2q operations, and the total complexity time is:

Om·n·2qOmnqE19

In a standard dataset, there are fewer features and records, i.e., mn, and the complexity running time can be bound by:

OmnqOn2qE20
Advertisement

3. Empirical study

3.1 Data sources

Seven datasets were compared to examine the ASI method:

  1. Fetal health [42]—a medical dataset that aims to prevent child and maternal mortality. The dataset consists of 2126 observations over 21 features and a target variable with three values: normal, suspect, or pathological.

  2. Students’ academic success [43]—this dataset contains information about students with different undergraduate degrees from higher education institutions. It includes information on 4424 students’ enrollment and academic performance over 36 features. The target variable has three options: graduate, dropout, or enrolled.

  3. Heart failure [44]—a dataset with a total of 299 patients who experienced heart failure. It contains 12 clinical features and a Boolean target variable representing whether the patient had heart failure.

  4. Diabetes [45]—a dataset of 768 diabetic and non-diabetic women. It consists of six medical features, two demographic variables and a Boolean target variable.

  5. Haberman’s survival [46]—a dataset collected from 1958 to 1970 that includes details on 306 survivors after breast cancer surgery. The dataset and study were conducted at the University of Chicago’s Billings Hospital.

  6. Breast cancer [47]—a dataset with a total of 30 features extracted from digitized images of diagnosed breast cancer of 568 patients. The target variable indicates whether the “mass” diagnosis was benign or malignant.

  7. Bank [48]—over 45,210 observations of direct marketing campaigns by Portuguese banks are included in the dataset, including seven numerical (continuous) and nine categorical features. The Boolean target variable indicates whether the client subscribed to a term deposit or not.

3.2 Experiment procedure

For each dataset described in Section 3.1, the following parameters were defined and used:

  1. At the beginning of each experiment, 25% of the data were randomly chosen, removed, and stored as a version of the original dataset to compare the imputation techniques.

  2. For the distribution estimation using the Kolmogorov-Smirnov test, a total of 100 samples with a significance level of α=0.05 were used.

  3. The results of the proposed ASI method were compared to the existing algorithms implemented in scikit-learn [49] as follows:

    1. KNN imputer—imputation for completing missing values using nearest neighbors’ algorithm.

    2. Iterative imputer—multivariate imputation by chained equations (MICE) [10]. As this imputer has stochastic phases, a random state was set equal to zero in all runs presented in this study.

  4. For each comparison, the values δ=0.1,0.05,0.01 were examined for the probability of failure in estimation.

3.3 Use case: the heart failure dataset

To simplify the demonstration, the heart failure dataset, consisting of 12 features over 299 observations, was chosen. The feature names were denoted as Fi for all 1i12. A broader comparison, including higher-dimensional datasets, can be found in Section 4. For the scenario demonstration, 56 arbitrary values were randomly removed (25%) and selected from 5 arbitrary features. Their value was then stored to compare to the proposed method. Table 1 presents the initial number of missing values for each feature and the automated distribution detected by the Kolmogorov-Smirnov test.

FeatureMissing valuesExpectationStandard deviationDistribution
F1150.4380.497Normal
F41038.05911.725Exponential
F5100.4160.494Exponential
F6111.3720.995Gamma
F813129.27278.065Beta

Table 1.

Number of missing values in each feature and its detected distribution.

Once the distributions were detected, the method calculated the number of samples required for each feature (i.e., q) by the ratio between its expected value and variance. Table 2 compares the values of q and the performance of the ASI method, the KNN imputer, and the MICE imputer. Since the proposed method input an upper bound on the probability of failure (i.e., δ), between δ=0.1,0.05,0.01 was compared. As the probability of error was smaller, the number of samples required to estimate the missing value increased, and the performance of the ASI method improved. For example, feature 5 for probability 0.1 yielded 40% success; 50% success was obtained for probability 0.05, and when the probability of error was bounded by 1% (i.e., 0.01), it increased to 60% success compared to the other algorithms.

FeatureqASIKNNMICE
δ=0.1
F1811 (73%)1 (7%)3 (20%)
F485 (50%)2 (20%)3 (30%)
F5304 (40%)4 (40%)2 (20%)
F6137 (64%)1 (9%)3 (27%)
F8155 (38%)3 (24%)5 (38%)
δ=0.05
F11012 (80%)1 (7%)2 (13%)
F4106 (60%)2 (20%)2 (20%)
F5365 (50%)3 (30%)2 (20%)
F6167 (64%)2 (18%)2 (18%)
F8196 (46%)3 (23%)4 (31%)
δ=0.01
F11512 (80%)1 (7%)2 (13%)
F4147 (70%)1 (10%)2 (20%)
F5526 (60%)2 (20%)2 (20%)
F6227 (64%)2 (18%)2 (18%)
F82710 (77%)1 (8%)2 (15%)

Table 2.

A comparison between the performance of the three imputation algorithms.

Notes. The data is represented as A(B), where A is the count of successes, and B is their percentages. q is the number of samples required for ASI.

For each compared delta (i.e., the probability of wrong estimation), the percentage of total successes of each imputation algorithm was calculated. The results are presented in Figure 1. The smaller the probability of error, the more samples from the distribution were required to estimate the imputed value. Accordingly, the performance of the ASI algorithm increased. For a 10% chance of wrong estimation, the proposed method succeeded in 54% of the cases, compared to 27% of MICE and 19% of KNN. When the chance of error was only 1%, the proposed method succeeded in 71% of cases, representing a significant increase compared to KNN’s 12%.

Figure 1.

Performance of three imputation algorithms compared by failure probability.

Advertisement

4. Results

To illustrate the ASI method, seven datasets (for details, see Section 3.1) were compared with the parameters defined in Section 3.2. For each dataset, the four algorithms were evaluated, and their performances were compared. First, the comparison results and analysis will be described. Then, the sensitivity analysis of the proposed method will be presented. Table 3 presents the total missing value for each dataset and the success percentages for each imputer.

DatasetMissing valuesASIKNNMICE
Fetal health425210 (49%)98 (23%)117 (28%)
Students9449 (52%)27 (29%)18 (19%)
Diabetes15284 (55%)42 (28%)26 (17%)
Heart failure5942 (71%)7 (12%)10 (17%)
Haberman survival2620 (77%)1 (4%)5 (19%)
Cancer11378 (69%)16 (14%)19 (17%)
Bank13,2767687 (58%)2276 (17%)3313 (25%)

Table 3.

Performance of the three imputation algorithms compared by dataset.

Notes. The data is represented as A(B), where A is the count of successes, and B is their percentages. The comparison used δ=0.05 for the ASI.

The ASI method achieved the highest success rates for all seven tested datasets compared to the KNN and MICE imputers. An extreme case was demonstrated in the Haberman survival dataset, for which a 77% success rate was recorded for the proposed method (i.e., 20 correct answers out of 26). Upon further analysis, this dataset consisted of discrete value features. Therefore, the suitability of the proposed method for discrete versus continuous values should be further examined in future studies.

To examine the error distributions, the errors obtained for each algorithm were calculated by subtracting the original value from the imputed value. These errors were then averaged for each dataset. Since this comparison was between the average values (and not distributions), polynomial interpolation was performed [50]. The results are shown in Figure 2.

Figure 2.

Polynomial interpolation of success rates of the three compared algorithms.

Two conclusions can be drawn from Figure 2. First, there is a high level of agreement between the three algorithms; that is, the ASI algorithm behaved similarly (distributive) to the KNN and MICE algorithms. Second, the ASI algorithm presented a low percentage of all the errors examined compared to the others. However, the ASI algorithm was probabilistic and dependent on input parameters that controlled the probability of failure. Thus, further analysis and comparison of different probabilities is required.

4.1 Sensitivity analysis

This section compares the performance of the ASI algorithm regarding different values of failure probabilities. The proposed algorithm was evaluated using each dataset with probabilities ranging from 0.1 to 0.9 with jumps of 0.1. Then, the percentage of successes the algorithm made was calculated and compared to the actual value. The results are shown in Table 4. Since the goal was to examine the algorithm’s behavior and compare the success percentages, polynomial interpolation was performed [50], as presented in Figure 3.

Datasetδ
0.10.20.30.40.50.60.70.80.9
Fetal health35.434.232.832.533.732.133.231.628.3
Students40.237.140.139.137.242.337.140.239.4
Diabetes38.236.935.935.333.634.438.239.630.3
Heart failure43.142.746.541.342.139.639.139.439.9
Haberman survival70.185.277.378.475.275.171.669.663.9
Cancer74.375.273.873.474.171.670.470.967.2
Bank49.847.446.144.642.641.640.340.840.2

Table 4.

Performance of the ASI algorithm, compared by probability of failure.

Figure 3.

A comparison between success rates of ASI over different probabilities.

In the sensitivity analysis of the ASI algorithm, it can be seen that each dataset behaved similarly for different probabilities. This conclusion indicates the consistency of the proposed method. It can also be concluded that noisier results were obtained for extreme probabilities (low and high) due to more substantial constraints on the method’s inputs (e.g., probabilities 0.9 and 0.2). One non-intuitive result was a probability of failure of 0.1, whose values appeared to be “outliers” from the rule. This result might be due to the randomness of the algorithm since, for each probability, the sampled data changed according to the method. However, this specific result was unusual because the other probabilities agreed on the general behavior of the proposed algorithm.

Advertisement

5. Discussion and conclusions

This study proposed a new automated data-driven and stochastic imputation method, ASI, to complete missing values in a dataset. The ASI is based on automated distribution detection and estimation of the imputed value by sampling with controlled error probability. This study’s innovation was the use of a data-driven approximation ratio based on the distribution measures and determination of the number of samples required for an accurate estimation. Thus, the imputer successfully bound the distance between the imputed value and the original expectation by, at most, one standard deviation. The following are the main conclusions:

  1. The ASI method succeeded in imputing the missing values in 61.5% of the cases, compared to the deterministic KNN and stochastic MICE algorithms. According to the analysis, there was a slightly difference in the performance of the proposed algorithm on features with discrete and continuous values.

  2. The number of samples required to estimate missing values increased as the error probability decreased. Thus, more samples yielded better estimation performance, although it increased the runtime complexity. For example, in the heart failure dataset (Section 3.3), the first feature required eight samples to yield a success rate of 73%, while the use of 15 samples improved the success rate to 80%.

  3. A sensitivity analysis of the ASI algorithm on different probabilities found consistency in its performance on seven datasets. It can be concluded that the correctness of the proposed method is tight and presents an accurate imputation of missing values in a given distribution.

Results are known to be affected by the data quality, especially when considering data imputation and stochastic models. Failures may result from a lack of data or incorrect adjustment of the parameters, as mentioned in this method. Future studies should address two main issues presented in this study. First, the performance of ASI over continuous and discrete random variables should be explored, which may be done by examining different values for the parameters or an extension of the distribution exploration. Second, this study assumed a confidence level of α=0.05 for distribution testing using the Kolmogorov-Smirnov test. Different significance levels may yield different results of distributions and, thus, the ASI algorithm’s results.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

Advertisement

Appendix

See Table A1.

TypeDistributionParametersExpectationVariance
DiscreteUniformUaba+b2ba+12112
BinomialBinnpnpnp1p
GeometricGp1p1pp2
HypergeometricHGNDnnDNnDN1DNNnN1
PoissonPoisλλλ
Negative BinomialNBnpn1ppn1pp2
ContinuousUniformUcaba+b2ba212
Exponentialexpλ1λ1λ2
NormalNμσ2μσ2
GammaGammanλnλ1λ2
BetaBetaabaa+baba+b+1a+b2
Chi2χ2kk2k
FFmnnn22n2m+n2mn22n4
TTm0mm2

Table A1.

List of distributions to determine the feature.

References

  1. 1. Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG. Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology. 2010;59(10):1087-1091. DOI: 10.1016/j.jclinepi.2006.01.014
  2. 2. Newman DA. Missing data: Five practical guidelines. Organizational Research Methods. 2014;17(4):372-411. DOI: 10.1177/1094428114548590
  3. 3. Salgado CM, Azevedo C, Proença H, Vieira SM. Missing data. In: Secondary Analysis of Electronic Health Records. MIT Critical Data. Cham: Springer; 2016. pp. 143-162
  4. 4. Akande O, Li F, Reiter J. An empirical comparison of multiple imputation methods for categorical data. The American Statistician. 2017;71(2):162-170. DOI: 10.1080/00031305.2016.1277158
  5. 5. Finch WH. Imputation methods for missing categorical questionnaire data: A comparison of approaches. Journal of Data Science. 2010;8(3):361-378
  6. 6. Schuckers M, Lopez M, Macdonald B. Estimation of player aging curves using regression and imputation. Annals of Operations Research. 2023;325:681-699. DOI: 10.1007/s10479-022-05127-y
  7. 7. Koren M, Koren O, Peretz O. Weighted distance classification method based on data intelligence. Expert Systems. 2023;41(2):e13486. DOI: 10.1111/exsy.13486
  8. 8. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520-525. DOI: 10.1093/bioinformatics/17.6.520
  9. 9. Zhang S. Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software. 2012;85(11):2541-2552. DOI: 10.1016/j.jss.2012.05.073
  10. 10. van Buuren S, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software. 2011;45:1-67. DOI: 10.18637/jss.v045.i03
  11. 11. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research. 2011;20(1):40-49. DOI: 10.1002/mpr.329
  12. 12. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine. 2011;30(4):377-399. DOI: 10.1002/sim.4067
  13. 13. Biessmann F, Rukat T, Schmidt P, Naidu P, Schelter S, Taptunov A, et al. DataWig: Missing value imputation for tables. Journal of Machine Learning Research. 2019;20(175):1-6
  14. 14. Phung S, Kumar A, Kim J. A deep learning technique for imputing missing healthcare data. In: 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 23-27 July 2019; Berlin, Germany. IEEE; 2019. pp. 6513-6516. DOI: 10.1109/EMBC.2019.8856760
  15. 15. Duan Y, Lv Y, Liu YL, Wang FY. An efficient realization of deep learning for traffic data imputation. Transportation Research Part C: Emerging Technologies. 2016;72:168-181. DOI: 10.1016/j.trc.2016.09.015
  16. 16. Lin WC, Tsai CF, Zhong JR. Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowledge-Based Systems. 2022;239:Article 108079. DOI: 10.1016/j.knosys.2021.108079
  17. 17. Gold MS, Bentler PM. Treatments of missing data: A Monte Carlo comparison of RBHDI, iterative stochastic regression imputation, and expectation-maximization. Structural Equation Modeling. 2000;7(3):319-355. DOI: 10.1207/S15328007SEM0703_1
  18. 18. Juan AA, Keenan P, Martí R, McGarraghy S, Panadero J, Carroll P, et al. A review of the role of heuristics in stochastic optimisation: From metaheuristics to learnheuristics. Annals of Operations Research. 2023;320(2):831-861. DOI: 10.1007/s10479-021-04142-9
  19. 19. Shehadeh KS, Padman R. Stochastic optimization approaches for elective surgery scheduling with downstream capacity constraints: Models, challenges, and opportunities. Computers & Operations Research. 2022;137:105523. DOI: 10.1016/j.cor.2021.105523
  20. 20. Raja K, Arasu GT, Nair CS. Imputation framework for missing values. International Journal of Computer Trends and Technology. 2012;3(2):215-219
  21. 21. Soeffker N, Ulmer MW, Mattfeld DC. Stochastic dynamic vehicle routing in the light of prescriptive analytics: A review. European Journal of Operational Research. 2022;298(3):801-820. DOI: 10.1016/j.ejor.2021.07.014
  22. 22. Andridge RR, Little RJ. A review of hot deck imputation for survey non-response. International Statistical Review. 2010;78(1):40-64. DOI: 10.1111/j.1751-5823.2010.00103.x
  23. 23. Kim JK, Fuller W. Fractional hot deck imputation. Biometrika. 2004;91(3):559-578. DOI: 10.1093/biomet/91.3.559
  24. 24. Wu Y, Xi X, He J. AFGSL: Automatic feature generation based on graph structure learning. Knowledge-Based Systems. 2022;238:Article 107835. DOI: 10.1016/j.knosys.2021.107835
  25. 25. Yao Q, Wang M, Chen Y, Dai W, Li Y-F, Wei-Wei T, et al. Taking human out of learning applications: A survey on automated machine learning. 2018;arXiv:1810.13306. DOI: 10.48550/arXiv.1810.13306
  26. 26. He X, Zhao K, Chu X. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems. 2021;212:Article 106622. DOI: 10.1016/j.knosys.2020.106622
  27. 27. Krishnan S, Franklin MJ, Goldberg K, Wu E. Boostclean: Automated error detection and repair for machine learning. 2017;arXiv:1711.01299. DOI: 10.48550/arXiv.1711.01299
  28. 28. Kenward MG, Carpenter J. Multiple imputation: Current perspectives. Statistical Methods in Medical Research. 2007;16(3):199-218. DOI: 10.1177/0962280206075304
  29. 29. Schafer JL. Multiple imputation: A primer. Statistical Methods in Medical Research. 1999;8(1):3-15. DOI: 10.1177/096228029900800102
  30. 30. Carpenter JR, Bartlett JW, Morris TP, Wood AM, Quartagno M, Kenward MG. Multiple Imputation and its Application. 2nd ed. Hoboken: John Wiley & Sons; 2023. 444 p
  31. 31. Koren O, Koren M, Peretz O. A procedure for anomaly detection and analysis. Engineering Applications of Artificial Intelligence. 2023;117:105503. DOI: 10.1016/j.engappai.2022.105503
  32. 32. Ozkan H, Pelvan OS, Kozat SS. Data imputation through the identification of local anomalies. IEEE Transactions on Neural Networks and Learning Systems. 2015;26(10):2381-2395. DOI: 10.1109/TNNLS.2014.2382606
  33. 33. Motwani R, Raghavan P. Randomized algorithms. ACM Computing Surveys. 1996;28(1):33-37
  34. 34. Karp RM. An introduction to randomized algorithms. Discrete Applied Mathematics. 1991;34(1–3):165-201. DOI: 10.1016/0166-218X(91)90086-C
  35. 35. Cohen JE. Markov's inequality and Chebyshev's inequality for tail probabilities: A sharper image. The American Statistician. 2015;69(1):5-7. DOI: 10.1080/00031305.2014.975842
  36. 36. Navarro J. A very simple proof of the multivariate Chebyshev's inequality. Communications in Statistics - Theory and Methods. 2016;45(12):3458-3463. DOI: 10.1080/03610926.2013.873135
  37. 37. Ogasawara H. The multivariate Markov and multiple Chebyshev inequalities. Communications in Statistics - Theory and Methods. 2020;49(2):441-453. DOI: 10.1080/03610926.2018.1543772
  38. 38. Klaassen CA. On an inequality of Chernoff. Annals of Probability. 1985;13(3):966-974
  39. 39. Rao BP, Sreehari M. Chernoff-type inequality and variance bounds. Journal of Statistical Planning and Inference. 1997;63(2):325-335. DOI: 10.1016/S0378-3758(97)00031-1
  40. 40. Hwang CR, Sheu SJ. A generalization of Chernoff inequality via stochastic analysis. Probability Theory and Related Fields. 1987;75(1):149-157. DOI: 10.1007/BF00320088
  41. 41. Massey FJ Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association. 1951;46(253):68-78. DOI: 10.1080/01621459.1951.10500769
  42. 42. Dua D, Graff C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science; 2019
  43. 43. Realinho V, Martins MV, Machado J, Baptista LMT. Predict students’ dropout and academic success data set. UCI Machine Learning Repository. 2021. DOI: 10.24432/C5MC89
  44. 44. Chicco D, Giuseppe J. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making. 2020;20(1):1-16. DOI: 10.1186/s12911-020-1023-5. [Article ID: 16]
  45. 45. Kahn M. Diabetes data set. UCI Machine Learning Repository. 1994. DOI: 10.24432/C5T59G. Available from: https://archive.ics.uci.edu/ml/datasets.php
  46. 46. Haberman S. Haberman's survival data set. UCI Machine Learning Repository. 1999. DOI: 10.24432/C5XK51. Available from: https://archive.ics.uci.edu/ml/datasets.php
  47. 47. Wolberg WH, Street WN, Mangasarian OL. Breast cancer Wisconsin (diagnostic). UCI Machine Learning Repository. 1995. DOI: 10.24432/C5DW2B. Available from: https://archive.ics.uci.edu/ml/datasets.php
  48. 48. Moro S, Cortez P, Rita P. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems. 2014;62:22-31. DOI: 10.1016/j.dss.2014.03.001
  49. 49. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research. 2011;12:2825-2830
  50. 50. Nakamura S. Numerical Analysis and Graphic Visualization with MATLAB. New York: Prentice-Hall, Inc.; 1995

Written By

Michal Koren and Or Peretz

Submitted: 23 December 2023 Reviewed: 03 January 2024 Published: 06 February 2024