Open access peer-reviewed chapter - ONLINE FIRST

Decorrelation and Imputation Methods for Multivariate Modeling

Written By

Gamze Erdogan Erten, Camilla Zacche da Silva and Jeff Boisvert

Submitted: 28 November 2022 Reviewed: 06 May 2024 Published: 14 June 2024

DOI: 10.5772/intechopen.115069

Applied Spatiotemporal Data Analytics and Machine Learning IntechOpen
Applied Spatiotemporal Data Analytics and Machine Learning Edited by Marko Maucec

From the Edited Volume

Applied Spatiotemporal Data Analytics and Machine Learning [Working Title]

Dr. Marko Maucec, Prof. Jeffrey M. Yarus, Dr. Timothy Coburn and Associate Prof. Michael Pyrcz

Chapter metrics overview

26 Chapter Downloads

View Full Metrics

Abstract

In most mining projects, multivariate modeling of regionalized variables has a critical impact on the final model due to complex multivariate relationships between correlated variables. In geostatistical modeling, multivariate transformations are commonly employed to model complex data relationships. This decorrelates or makes the variables independent, which enables the generation of independent models for each variable while maintaining the ability to restore multivariate relationships through a back-transformation. There are a myriad of transformation methods, however, this chapter discusses the most applied methods in geostatistical procedures. These include principal component analysis (PCA), minimum/maximum autocorrelation factors (MAF), stepwise conditional transform (SCT), and projection pursuit multivariate transform (PPMT). All these transforms require equally sampled data. In the case of unequal sampling, it is common practice to either exclude the incomplete samples or impute the missing values. Data imputation is recommended in many scientific fields as removing incomplete samples usually removes valuable information from modeling workflows. Three common imputation methods are discussed in this chapter: single imputation (SI), maximum likelihood estimation (MLE), and multiple imputation (MI). Bayesian updating (BU) is also discussed as an adaptation of MI to geostatistical analysis. MI methods are preferred in geostatistical analysis because they reproduce the variability of variables and reflect the uncertainty of missing values.

Keywords

  • geostatistics
  • multivariate modeling
  • multivariate transform
  • decorrelation
  • missing data imputation

1. Introduction

Geostatistical modeling problems often consider more than one variable of interest. In the geosciences, these are obtained through log, core, and remote sensing measurements made in a subsurface domain of interest. Geostatistical modeling requires spatial characterization of multiple attributes to ensure the final model correctly reproduces the relations between all variables. Traditional geostatistical approaches that account for multivariate heterotopic data are based on a cokriging framework [1], which makes use of the linear model of coregionalization (LMC). To utilize LMC, all direct and cross-covariances between experimental variables must be modeled simultaneously. This can be a tedious task because some datasets may contain many variables, but automatic fitting software usually fits LMCs automatically.

To avoid the complexities of cokriging and LMC, simplified methodologies - such as Markov models [2, 3] and decorrelation methods - are used in spatial modeling workflows. Markov models impose the presence of an exhaustive secondary attribute that is used as auxiliary information to model a variable of interest [4]. Although Markov models are simpler methods for multivariate modeling, they are limited in their ability to reproduce the targeted spatial and multivariate relationships. Alternatively, decorrelation methods—such as principal components analysis (PCA) [5, 6], minimum/maximum autocorrelation factors (MAF) [7, 8], stepwise conditional transform (SCT) [9, 10], and projection pursuit multivariate transform (PPMT) [11, 12]—do not require the use of an LMC and cokriging. Decorrelation methods transform correlated variables into a set of uncorrelated/independent components or factors. Each factor can then be modeled independently, and the correlations between the attributes are restored through the back transform.

A major limitation of such methodologies is the need for homotopic sampling, which requires the same number of samples to be collected for each variable at all sampling locations [13]; in most practical applications it is challenging to adhere to homotopic sampling. This difficulty, rather than homotopic sampling itself, can lead to distortions in the known distributions of the variables, which could negatively impact the final model. Thus, heterotopic sampling configurations must be preprocessed before these multivariate techniques can be applied [14].

During data preprocessing when considering decorrelation workflows, samples with missing values in multivariate data are either excluded or imputed. Excluding samples with missing values is rarely considered because it reduces the amount of information available to build the model [15]; exclusion is only appropriate when there are few samples with missing values and the modeler decides that their removal will not have a negative impact. Utilizing data imputation maintains the initial dataset size; however, the approach must account for data variability, local conditional statistics, and spatial correlations in geostatistical analysis to ensure the final model is in alignment with the observed phenomena. It should be noted that conventional imputation methods are not specifically optimized for geostatistical analysis. Thus, imputation methods that consider spatial data, such as Bayesian updating (BU), are preferred in geological data analysis [16, 17].

This chapter introduces the fundamentals of multivariate transforms and missing data theory. The following section begins with an overview of multivariate transform theory and discusses common decorrelation methods such as PCA, MAF, SCT, and PPMT. After outlining these multivariate transforms, an overview of missing data theory is provided. The chapter concludes with a review of conventional missing data imputation methods—single imputation (SI), maximum likelihood estimation (MLE), and multiple imputation (MI)—and their adaptation to geostatistical workflows.

Advertisement

2. Multivariate transforms

This section introduces required notation and multivariate stationary random functions. Next, the section describes the use of the LMC to represent the spatial correlation structure of coregionalized variables. Fitting an LMC with sparse multivariate data is difficult in practice so decorrelation transforms have become a common alternative; these remove correlation between variables so the model can be generated independently for each factor and recombined to model the target variables. An overview of the most popular decorrelation transformations is provided. The section concludes with a discussion of practical considerations and implementation details of the multivariate data transformations.

2.1 Multivariate stationary random functions

Stationarity is an important concept in modeling of continuous spatial variables. Consider that Zu:uD is a second-order stationary random function defined over a region D and ziukk=1ni=1m represents n observations of m correlated variables. A random function is second-order stationary when the mean vector of Zu is calculated by:

EZu=m,uDE1

and the variance vector σ2 is defined as:

EZum2=σ2,uDE2

The variance-covariance matrix at a lag h=0 is C0=covZu and the covariance is:

EZumTZu+hm=Ch,u,u+hDE3

The variogram matrix of direct and cross-variograms is given by:

Γh=1/2EZuZu+hTZuZ(u+h,u,u+hDE4

where T represents transposition, Zu=Z1uZmu, and m=m1mm. If a random function is second-order stationary and also intrinsic, the variogram Γh and its covariance Ch are equivalent functions; that is, Γh=C0Ch.

2.2 The linear model of coregionalization (LMC)

LMC assumes that Ziu is a linear combination of pairwise and spatially uncorrelated factors Yku, k=0,,K with E[Yku] = 0 and var(Yku) = 1 [18, 19, 20]; that is,

Ziu=mi+k=0KαikYiku,i,uE5

An LMC fits direct and cross-variograms jointly with a positive definite model. In this procedure, Yiku is assumed to have the same variogram function γkh with bijk=αikαjk, i,j coefficients, with the matrix form of the LMC given by:

Γh=k=0KBkγkhE6

where Bk is a positive semi-definite coregionalization matrix. This matrix contains the contributions of the variogram functions γkh. First the K+1 basic variogram functions γkh are defined with common range parameters for all direct and cross-variograms of the individual variogram functions. An automatic fitting procedure can be implemented to iteratively determine the coefficients bijk that fit the experimental data [21].

2.3 Principal component analysis (PCA)

PCA is a process that transforms the set of variables to a new reference where they are uncorrelated (Figure 1) [5, 18]. PCA serves many purposes in geostatistical modeling such as dimensionality reduction, identification of underlying factors, decorrelation of variables for independent modeling of multivariate systems, and verification of intrinsic correlation [22].

Figure 1.

Original coordinates/variables (left) and rotated uncorrelated coordinates/variables along the principal axis of the data through PCA (right).

Consider the correlated variables Zu=Z1uZmu are transformed into pairwise (h=0) uncorrelated factors Yu=Y1uYmu. Figure 1 (right) shows two orthogonal uncorrelated PCA factors (principal components). Y1 accounts for the majority of variance within the data, while Y2 accounts for the remaining variance not explained by Y1. The implementation of the PCA transform and its back-transform are given as follows [23]:

  1. Transformation of correlated variables Zu=Z1uZmu to Gaussian variables YuN01. Yu=G1FZu. This step is optional but having a mean of zero and a variance of one improves the interpretability of the results.

  2. Calculation of the covariance matrix of Yu; that is, Y=1nYuYuT,

    where ΣY symbolizes the variance-covariance matrix.

  3. Spectral decomposition of Y yields the eigenvector matrix VVTV=I and the diagonal eigenvalue matrix D, Y=VDVT. While each row of the orthogonal matrix V has vectors that explain the primary directions of the variability, the magnitude of the variability for each row of V is described by the diagonal matrix D.

  4. Calculation of the standardized components Yu. This is the linear matrix multiplication of Yu with B,Yu=BYu, where B=D12.

  5. Back transformation of Yu to the Gaussian variables; that is, Yu=VTYu, if used in the modeling workflow.

  6. Back transformation of Yu to the original variables Zu; that is, Zu=F1GYu.

The PCA transform and back transform are schematically illustrated in Figure 2 with the Jura dataset [24, 25]. Note that only two variables (nickel (Ni) and zinc (Zn)) are included for clarity.

Figure 2.

Schematic representation of the PCA transform and back-transform steps for Ni and Zn variables of the Jura dataset.

The correlated Ni and Zn variables of Jura data in Figure 2 are first transformed into normal scores (NS_Ni and NS_Zn). Next, the components are generated (Y1 and Y2) for the data. Finally, the components are back-transformed to normal scores and then a normal score back-transform is performed to regenerate the original values. The principal components (PCs) are obtained in a ranked order, where the first component is given by the first eigenvector of the spectral decomposition of Y and is associated with the first eigenvalue. The eigenvalues are linked to the proportion of variance associated with each PC. Generally, the majority of data variance is accounted for within the first few PCs, with the proportion calculated by the following equation:

di,i/trD100E7

di,i describes each diagonal entry in D that corresponds to the variance of a PC. The eigenvalues of each component of the Jura data, for example, in Figure 2 are 1.66 (accounting for 85% of the total variance) and 0.34 (accounting for 15% of the total variance), respectively. By obtaining the proportion of data variance represented, the relative variance contribution of each component can be identified. Knowing each component’s variance contribution provides insight into whether a dimensionality reduction is necessary to reduce system complexity. Dimensionality reduction retains components with high system-variance contributions. This simplifies the overall system while avoiding significant information loss. However, it is important to highlight that PCA is sensitive to outliers. In the presence of outliers, the ratio of variance attributed to each component may be distorted. Thus, it is recommended that the data be standardized before transformation. Moreover, if fewer than m components are used in geostatistical analysis, the back transformation will not perfectly reproduce the input data, which is typically required in spatial modeling contexts.

PCA is widely used due to its straightforward application. Nonetheless, there are drawbacks when the goal is to remove spatial correlation from the data because decorrelation can only be ensured for null lag separation vectors (h=0). PCA can only achieve spatial decorrelation in intrinsic correlation models where the correlation between variables is independent of the lagh [26]. PCA analysis is rather useful for detecting intrinsic correlation among variables. It can be directly applied in multivariate modeling procedures when the correlation between two variables does not depend on the spatial scale. In such cases, the direct and cross-covariance functions are chosen proportional to the same basic spatial correlation function. On the other hand, when the variables do not conform to an intrinsic correlation model, the use of alternative transformations tailored for spatial decorrelation should be considered.

PCA also assumes that the h=0 multivariate distribution is completely parameterized by Y [13]. When the data has complex features, such as nonlinear relationships, mineralogical constraints, and heteroscedasticity, this assumption causes uncorrelated but not independent principal components at h=0. The back transformation of those components is unlikely to reproduce the original multivariate dependencies.

2.4 Minimum/maximum autocorrelation factors

The MAF transform was first proposed by Switzer [8] for use in image processing and remote sensing fields. It was later used by ref. [7] in geostatistical modeling. MAF applies PCA transforms twice, to the variance-covariance matrix Y and the experimental variogram matrix of the standardized components Γ̂Yh built on one chosen lag h0. The MAF transformation results in random variables that are spatially uncorrelated at h=0 and have no cross-correlations at the chosen lag hj>0,j=1,,J [27, 28]. The implementation steps of the MAF transform and its back-transform are given as follows:

  1. Calculate the standardized components Yu at h=0. Refer to PCA steps I-IV given in Section 2.3.

  2. Select the lag distance hjj=1J and compute the experimental variogram matrix Γ̂hj.

  3. Diagonalize Γ̂hj; that is Γ̂hj=VYDVYT.

    Calculate the spatially uncorrelated MAF factors Yu; that is, Yu=VYYu. This step removes structure at both h=0 and hj>0

  4. Back transform Yu to the PCs; that is,Yu=VY1Yu.

  5. Back transform Yu to the Gaussian Yu and the original valuesZu. Refer to steps V and VI in Section 2.3.

This two-step transformation generates factors Yu that are uncorrelated at both h=0 and the selected lag distance hj>0 (steps III, IV). Figure 3 shows the scatter plots of the original chromium (Cr), Ni, and Zn variables of the Jura dataset and the scatter plots of the MAF factors.

Figure 3.

Scatter plots of the correlated data (left) and MAF data (right) for Cr, Ni, and Zn variables of the Jura dataset.

It is clear from Figure 3 that the MAF transform produces uncorrelated factors MAF1, MAF2, and MAF3. The important aspect of the MAF transformation is found in the cross-variograms. Figure 4 shows the cross-variograms of the original and transformed variables. Although MAF is efficient and facilitates a simplified modeling process as the factors are uncorrelated, there are limitations. For example, spatial decorrelation only occurs for a specific lag distance h that is user-defined. The choice of h is a modeling decision, but it is commonly chosen according to the lag in which the spatial correlation is observed for components (i.e., PCAs) obtained in a previous exploratory data analysis step.

Figure 4.

The original and transformed variables cross-variograms for Cr, Ni, and Zn variables of the Jura dataset.

The cross-correlations for MAF in Figure 4 are zero at h=0.6km. Dimitrakopoulos and Fonseca [29] presented a model-based derivation of MAF, where the direct and cross-spatial continuity of the attributes are used to transform the original variables and obtain uncorrelated factors at any lag distance. Vargas-Guzmán [30] discuss that MAF factors are generally based on omnidirectional covariance matrices, but each MAF factor can be modeled according to its anisotropy; however, the solution may not be unique if direction is considered [30]. Bailey and Krzanowski [31] proposed a generalization of MAF that produces factors with nearly zero cross-correlations at all given lags using a rotation matrix. Nonetheless, like PCA, MAF is a linear transformation that does not account for complex multivariate features, which implies that such relations may not be reproduced in the final model. This has motivated the use of multivariate transforms such as SCT and PPMT.

2.5 Stepwise conditional transform

Many geostatistical workflows rely on multivariate Gaussian distributions; however, geological variables are rarely Gaussian. For this reason, variables are often transformed prior to geostatistical modeling [12]. Generally, normal scores transform [31, 32] is applied, but this approach only ensures the transformed variables have univariate Gaussian distributions. In the presence of multivariate correlated variables, the modeling process becomes increasingly difficult. Independence between variables cannot be guaranteed using linear decorrelation transforms (PCA and MAF) in datasets with complex multivariate features [12]. SCT [9] and PPMT [12] are alternative transformations that aim to generate components that are independent, rather than just uncorrelated.

SCT was first introduced by Rosenblatt [10] and applied to geostatistical workflows by Leuangthong and Deutsch [9]. This transform handles complex features while decorrelating the variables at h=0 to create uncorrelated independent multiGaussian distributions. SCT is performed in a chained manner, where the Mth variable is conditionally transformed based on the first M1 variables [9]. The steps in stepwise conditional transformation are given as follows:

  1. Transformation of the first variable Z1u to a Gaussian variable Y1uN01; that is, Y1u=G1F1Z1u.

  2. Conditional transformation of subsequent variables Z2uZMu to the previous variables M1; that is, YMu=G1FM1,,M1ZMuY1,,M1u.

  3. Back transformation of the Mth Gaussian variable YMu to the original variables ZMu that is, ZMu=FM1,,M11GYMuY1,,M1u.

SCT facilitates the simulation of multiple variables with complex relationships by enabling the independent Gaussian simulation of YMu . The SCT back transform provides the reproduction of the original complex features. This procedure is illustrated in Figure 5 for the correlated variables (Ni and Zn) from the Jura dataset.

Figure 5.

SCT and its back-transform for Ni and Zn variables of the Jura dataset.

In Figure 5, a normal score transformation is applied to Ni, while Zn is conditionally transformed based on Ni. For SCT, the order of the variables in the transformation affects the final components obtained; typically, the most sampled or the most important variables are transformed first. The first variable does not require any other variables to define the transform, while the subsequent variables are conditional to all previously transformed variables and their variograms are the integration of the spatial structure of previously transformed variables and their cross-correlation [9].

Another challenge with SCT is that many samples are required to obtain reliable conditional distributions. Although Leuangthong and Deutsch [9] proposed that the gaps in the multivariate distribution can be filled through kernel density estimation [33], SCT may not be feasible for more than two to four variables in geological datasets because of typical sample quantities [13]. SCT is also prone to artifacts at the binning boundaries. Thus, fitting a probabilistic Gaussian mixture model (GMM) can be useful to calculate the conditional distributions required for SCT and avoid binning [34].

SCT has been successfully applied in geostatistical workflows. For example, Hosseini and Asghari [35] applied SCT to the simulation of three variables with complex multivariate relationships with stoichiometric constraints. Recently, Figueiredo et al. [36] proposed an algorithm that performs direct simulation using the SCT. The algorithm introduces a uniformly distributed auxiliary random variable that establishes conditional distributions non-parametrically. These distributions are then used to form the spatial correlation model.

2.6 Projection pursuit multivariate transform (PPMT)

The projection pursuit algorithm was initially proposed by Friedman and Tukey [37] for multivariate data analysis. The core idea of this algorithm is to find directions onto which multidimensional data can be projected to reveal important data features. Projection pursuit has been applied to density estimation, regression, and classification [38]. The algorithm also has a factor that transforms the original data toward a multivariate Gaussian distribution [37, 39]; PPMT was proposed to make use of this tendency within a geostatistical framework [12]. PPMT transforms complex high-dimensional geologic data to a pairwise (h=0) uncorrelated multi-Gaussian distribution. It is a non-linear extension of PCA that consists of two steps: (1) preprocessing to remove the linear dependency between the variables and (2) using projection pursuit to iteratively remove complex multivariate features and create independent variables. The implementation steps of the PPMT forward transform and its back-transform are given as follows:

  1. Preprocess the data: Follow steps I–III given in Section 2.3, considering the Gaussian transform.

  2. Transform the data Yu to be uncorrelated with unit variance by sphering; that is Yu=S12Yu, where S12=VD12VT. The rotated data Yu has an identity covariance matrix.

    These preprocessing steps are visualized for the correlated variables (Cr and Zn) from the Jura dataset in Figure 6.

    It is clear from Figure 6 that the sphere transform removed the bivariate relationship in the original and normal score data. Although sphering only removes the linear correlation, complex dependencies still exist between the variables. The PPMT addresses these complexities. After preprocessing, the remaining PPMT steps are:

  3. Consider an m dimension unit length vector θ and the projection of Yu upon it, p=θYu.

  4. Determine the optimum θ that maximizes the projection index Iθ quantifying its univariate non-Gaussian nature. If p is perfectly Gaussian for any θ, then Iθ is equal to zero. In this step, Yu is transformed into Ŷu so that the projection p along θ,p̂=θŶu is Gaussianized [40].

  5. Iterate Step 4 until Iθ reaches convergence. This yields an uncorrelated and multivariate Gaussian Ŷu. Studies discussed in Refs. [12, 40] highlight that defining the target value for Iθ is complex. For the details on the stopping criteria, the interested reader is referred to Ref. [37].

  6. Reverse of the forward transform order to back-transform Ŷu to Yu, for each projection-pursuit iteration [40].

  7. Back-transform Yu to the Gaussian variables Yu; that is, Yu=YuS12.

  8. Back-transform Yu to the original variables Zu; that is, Zu=F1GYu.

Figure 6.

The original, normal score, and sphered data scatter plots.

The PPMT transform and the convergence of the projection index are illustrated with the Jura data in Figure 7.

Figure 7.

Scatter plots of the PPMT transform (left) and progression of the projection index (right) over 65 iterations.

Figure 7 (right) shows the convergence of the projection index Iθ through the iteration process. The non-Gaussian nature of the projection decreases (i.e., it becomes more Gaussian-like) after 60 iterations. Increase in the projection index across iterations indicate that a local optimum is found in the previous iteration.

PPMT can be applied to any number of variables. However, similar to PCA and SCT, it obtains decorrelation of the variables at a null h=0 lag distance. Thus, the spatial dependency must be inspected after the transform is performed. In the case of residual spatial correlation between the transformed variables, MAF may be applied to remove remnant correlations at the lag distance of interest. Kloeckner et al. [41] proposed that the modeling effort in multivariate settings can be greatly reduced by using the PPMT where each transformed variable (Step V. above) can be modeled independently. Kloeckner et al. [41] showed that the transformation process did not cause significant distortions in the data distributions and that key statistical parameters, such as histograms and spatial correlations, were reproduced in the final model.

2.7 Summary

In geostatistical modeling, the use of multivariate transformations is an effective alternative to existing intensive multivariate modeling workflows like LMC. The multivariate transformations apply a decorrelation or independence transformation, allowing each resultant factor to be modeled independently. The first and second-order statistics and original bivariate relationships among the data are restored through the back transform. PCA (see Section 2.3) decomposes the correlated multivariate variables into orthogonal factors that are pairwise uncorrelated at a lag distance zero h=0. PCA assumes that the factors are also uncorrelated at all possible lags hjj=1J - though only valid in the case of intrinsic correlation models. MAF (see Section 2.4) is a linear extension of PCA and transforms variables into orthogonal factors; these are uncorrelated at h=0 and have no cross-correlations at a particular lag distance hj>0. MAF factors are uncorrelated at all lag distances assuming that the multivariate data can be modeled using a two-structure LMC. As the PCA and MAF are linear transforms, they do not result in independent variables in cases with multivariate complexities such as non-linearity, heteroscedasticity, and constraints. SCT (Section 2.5) and PPMT (Section 2.6) are the most common multivariate Gaussian transforms and they remove non-Gaussian features, resulting in transformed multi-Gaussian data that are independent at the lag distance zero h=0. While the implementation of SCT is often challenging for more than two to four variables, PPMT can manage datasets containing a high number of variables. Moreover, the combination of PPMT and MAF can be used as a chain transform when correlation is observed at a lag >0.

Advertisement

3. Data imputation

In a mineral deposit, it is common to have missing variables at sampling locations; sampling cost, changing field conditions, sampling quality, evolving drilling technology, and changing project goals mean that the variables assayed change over time and space. Unequally sampled (heterotopic) data presents a problem in geostatistical analysis when the selected multivariate transformation (Section 2) requires equally sampled (homotopic) data at all data locations. Figure 8 demonstrates this missing data concern for a simple example with heterotopic data.

Figure 8.

A dataset composed of three heterotopically sampled variables at nine locations. There are only 2 locations (4 and 7) where all three variables are collocated. Samples in different colors on the map are shown in the figure as slightly offset for clarity, but they are collocated.

The sample data in Figure 8 contains 18 data entries at 9 locations and three variables of interest. The blank data entries indicate missing variables in a particular sample. In this example, multivariate transformation methods cannot be applied directly to any record where one or more variables are missing. One option is to exclude incomplete observations; however, this creates a loss of information that may introduce additional uncertainty and bias into the resultant model [14]. To maintain the integrity of the original data, imputation of the missing variables through estimation or simulation can be considered as an alternative approach to satisfy a homotopic data requirement in the modeling workflow.

In the following section, missing data mechanisms that cause data to be missing are introduced. This is followed by an overview of the conventional missing data imputation methods and a discussion of their suitability in various missing data mechanisms. Imputation methods that consider the spatial and multivariate structure of geological data are reviewed. The section concludes with a discussion regarding the application of imputation methods within geostatistical workflows.

3.1 Missing data mechanisms

Some of the earliest considerations of missing data mechanisms were provided by Rubin [42], and the reader is referred there for more details. The more general ideas are discussed here.

Consider that K spatial variables Z1ZK have been sampled in differing subsets at n observation locations, resulting in the data matrix Z: zi,k,i=1,,n,k=1,,K. According to Rubin [42], a binary matrix R can be defined to specify the sampled (0) and missing values (1) of Z associated with a probability distribution fRZ,ϕ,where f is the probability distribution function and ϕis a vector of parameters that explain the missingness.

Missingness is grouped into three categories, arising from different mechanisms: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR) [43]. The MCAR missing mechanism occurs when the probability of being missing is independent of the values Z according to fRZ,ϕ=fRϕZ,ϕ. Consider that ZO and ZM represent the observed and missing values, respectively. MCAR implies that ZM and ZO share the same overall distributions. This mechanism does not create a bias in further statistical analysis and geostatistical modeling.

MAR occurs when the probability of missingness is only related to ZO according to fRZ,ϕ=fRZO,ϕZM,ϕ. MAR implies that missingness is random for ZM. MAR is also referred to as ignorable missingness. Available imputation methods may generate unbiased results without special handling in the case of MAR.

If missing mechanisms are neither MCAR nor MAR, then the probability of being missing is not random and is related to ZO and ZM according to fRZ,ϕ=fRZO,ZM,ϕϕ [44]. The MNAR mechanism is the most complex; the missing data hold important information about the data population that cannot be ignored and introduces bias in subsequent modeling [45]. Figure 9 illustrates the three different missing data mechanisms (MCAR, MAR, and MNAR) with the Walker Lake dataset [46]. The Walker Lake dataset used herein consists of two correlated variables (U and V, which were anonymously named, representing concentration in parts per million).

Figure 9.

Walker Lake dataset example depicting the three missing data mechanisms (MCAR, MAR, and MNAR) for the variable U.

In Figure 9, the complete dataset for the variable U was produced by regularly sampling the dataset at 3068 locations. The MCAR mechanism was created by setting randomly drawn U values to be missing. V values below 135.49 and U values below 20.49 were assigned as missing for the MAR and MNAR mechanisms, respectively. The range of missing values was set to 35% for each missing mechanism. Figure 9 also shows how missing values in different mechanisms affect the U variable distribution. U variable statistics do not change significantly in the MCAR mechanism. In comparison, there are variations in the statistics of U for both MAR and MNAR. Figure 10 illustrates the effect of different missing mechanisms on the variogram maps.

Figure 10.

Walker Lake dataset variogram maps depicting changes in anisotropy direction using various missing data mechanisms for the variable U.

It is clear from Figure 10 that while the orientation of MCAR appears close to the observed variogram map of the complete data, the orientation of MAR and MNAR differ because of the mechanism of missing data.

3.2 Imputation methods

Imputation can be defined as the replacement/modeling of missing values with a suitable method. Many imputation methods exist in literature [39]; the most common include single imputation (SI), maximum likelihood estimation (MLE), and multiple imputation (MI). An overview and discussion of these three methods are given below.

3.2.1 Single imputation (SI)

These methods generate a single value to replace missing data based on a certain mathematical rule. Single imputation methods include the following:

  1. Arithmetic mean imputation: This method fills in missing values with the stationary mean of that variable [47]. This method alters the variance of the distribution but preserves the mean. This method may not be effective in mineral deposit applications where cut-off grades are important.

  2. Regression or conditional mean imputation: This method fills in missing values based on a regression fit to correlated secondary variable(s) available at all missing locations [48]. Cokriging could be used. This method may produce biased parameter estimates depending on the missing mechanism and relationship to the secondary variable.

  3. Stochastic regression imputation: This method is an improvement over regression imputation. It applies a stochastic method to add realistic variability to the regression model under the assumption of normally distributed errors [49]. This method removes the biases in standard regression imputation methods.

Figure 11 shows the application of single imputation methods on a simplified example with a MAR mechanism. Figure 11(a) displays the complete dataset, with gray points representing the observed data and red points indicating the missing data that were initially removed from the overall dataset to demonstrate the single SI methods.

Figure 11.

Single imputation methods: (a) complete dataset, red points are removed in subsequent plots for demonstration, (b) mean imputation of the red points, (c) regression imputation of the red points, (d) stochastic regression imputation of the red points.

SI methods, apart from stochastic regression imputation, may introduce bias for MAR and MNAR mechanisms. Stochastic regression imputation typically results in less bias. SI imputation does not consider uncertainty in imputed values. This is the main motivation for other methods.

3.2.2 Maximum likelihood estimation (MLE)

MLE uses a log-likelihood function to identify distribution model parameters θR, such as mean and variance, with the greatest likelihood of producing the observed data. A general method to find the unknown population θ is the expectation-maximization (EM) algorithm, this is useful when the joint distribution of ZO and ZM is explicit [50]. The basic idea of EM is to identify the estimate of θ that maximizes the observed data log-likelihood lθZO, that is, the probability density function of the observations fZOθ:

lθZO=logfZOθ=logfZOZMθdZME8

In general cases, the quantity described in Eq. 8 cannot be calculated explicitly. Therefore, EM defines the MLE by iteratively maximizing the expected complete-data log-likelihood lZθ as follows:

lZθ=logfZOZMθE9

Consider that θ0 represents an arbitrary initial value and θt is the estimate of θ at tth iteration. The EM algorithm implements two steps in its next iteration:

  1. The expectation (E) step: The expectation of lZθ regarding the conditional distribution of missing covariate parameterized by θt:

    Qθθt=ElZθZOθt=lZθfZMZOθtdZME10

  2. The maximization (M) step: θt+1 is defined by maximizing the functionQ:

lZθ=logfZOZMθE11

The iteration is terminated when successive estimates of θ are nearly identical. The θt+1 that maximizes Qθθt is guaranteed to give an lθZO that is greater than that provided by,θt [50].

3.2.3 Multiple imputation (MI)

In MI, multiple data sets are generated with different values imputed for each missing value at each location. Each complete dataset is then analyzed in the same fashion and could be said to represent ‘realizations of the data’. Uncertainty in the imputed values is obtained by combining the results (i.e., an ensemble) to be used in subsequent modeling or to form the overall estimates and associated standard errors.

In general, the use of MLE or MI is strongly preferred for MAR mechanisms due to their ability to accurately reproduce data distributions and provide appropriate estimates of the uncertainty for imputed data [51]. Moreover, these methods are less likely to introduce bias. Imputation for the MNAR mechanism, however, requires a specialized approach; none of the reviewed imputation methods are directly appropriate. MNAR mechanism-specific methods include selection models [52] and pattern mixture models [53, 54]. However, these methods make strong assumptions regarding the data and can introduce severe bias. Therefore, Rubin [42] and Collins et al. [55] suggested the inclusion of auxiliary variables to help explain the missingness.

For geostatistical analysis, MLE and MI are asymptotically equivalent [43] but MI is preferred because it produces data realizations that can be easily adapted to geostatistical simulation workflows. A typical MI method follows five basic steps [56]: (1) a conditional distribution fZMZO is created to impute values using the observed data and a prior model, (2) a realization of the imputed values ZMl is generated by sampling imputed data stochastically from the modeled conditional distribution, (3) the procedure is repeated to produce L complete datasets Zn=ZOZMl, l=1,,L, (4) statistical analysis is applied to each Zl, and (5) the results are combined. The MI process for the geostatistical modeling is illustrated in Figure 12.

Figure 12.

Schematic illustration of MI in geostatistical modeling [13].

Each imputed data realization in Figure 12 requires its own geostatistical realization. The theory of MI suggests that a standard number of realizations is around three to five [57]. More recent work, however, has recommended that MI should be performed for more realizations, say 20–100 [58]. In the geostatistical framework, 100–200 realizations are common to sample from the space of subsurface uncertainty.

3.3 Imputation for geological data

Commonly available MI methods are not directly suitable for geological data. In constructing the distribution of potential outcomes for a missing value, the following should be considered: (1) spatially correlated values; (2) collocated and correlated values of other variables; and (3) known complex multivariate features [13]. In recent years, imputation methods for geological data are commonly based on BU.

BU was proposed by Ren [17] to adapt the MI concept to a geostatistical setting. Consider a location u where the primary variable value is to be imputed. BU uses the primary data yPuii=1n1 which is not available at location u and the secondary data xjuij=1mi=1n which are available in the entire domain [59]. Considering y and xj are multi-Gaussian, the BU procedure is described as follows:

  1. Prior distribution: Estimate the primary distribution y¯Pu and variance σP2u according to:

    y¯Pu=i=1nλiyPuiE12
    i=1nλiCuiuk=Cuuk,k=1,,nE13
    σP2u=1i=1nλiCuuiE14

    where weights λi are calculated through the data-to-data covariance Cuiuk and the data-to-unknown covariance Cuui (Eq. 13). This is equivalent to using the normal equations as in simple kriging. Step 1 provides the distribution of potential outcomes according to the primary variable or variable being imputed.

  2. Likelihood estimation: Define the likelihood distribution by mean y¯Lu and variance σL2u according to the following equations:

    y¯Lu=j=1mλjxjuE15
    j=1mλjρjk=ρkY,k=1,,mE16
    σL2u=1j=1mλiρjYE17

    where xju is the secondary information at the unsampled location u, and the weights λj are calculated using the cross-correlation between the primary and secondary data ρkY, and cross-correlation between the different types of secondary data ρjY. This is equivalent to using the normal equations for linear least squares regression. Step 2 provides the distribution of potential outcomes according to the secondary variables.

  3. Updated distribution: Merge the prior and likelihood distribution to define the updated distribution according to:

    y¯Uu=y¯LuσP2u+y¯PuσL2uσP2uσP2uσL2u+σL2uE18
    σU2u=σL2uσP2uσP2uσP2uσL2u+σL2uE19

    where y¯Uu is the updated mean and σU2u is the updated variance. Since the posterior distribution is Gaussian and completely determined by y¯Uu and σU2u, the full distribution of the missing values is obtained based on the spatially correlated primary and secondary variables. The first three steps are summarized in Figure 13.

    The likelihood variance map shown in Figure 13 is constant because the correlations ρkY are derived from the data at the collocated locations and assumed to be constant in the region of interest. This leads to the variance also being constant [59].

  4. Stochastically simulate: The first three steps are implemented for every missing value in a dataset. Then the data realizations are stochastically simulated by randomly sampling the (empirical) Gaussian cumulative distribution functions. It should be noted that the imputed values vary according to the degree of uncertainty in its updated distribution σU2u, while the sampled values remain constant (i.e., uncertainty is equal to 0) for each realization [60]. BU yields promising results when the missing mechanism is MAR.

Figure 13.

The Bayesian updating (BU) process [17].

BU is a practical imputation approach for geological data because it only requires a normal score transform that can be applied to heterotopic data. Moreover, in BU, spatial and multivariate conditioning data are incorporated and sets of variables at each location are naturally integrated. In addition, BU accounts for both spatial and collocated data sources and allows for the construction of uncertainty distributions for the missing values.

3.4 Summary

Unequally sampled data is a common occurrence in geospatial datasets. The homotopic data requirement of many multivariate modeling workflows highlights the importance of missing data imputation methodologies. Missing data imputation is the preferred approach to managing heterotopic data. The three common imputation methods are (1) SI, (2) MLE, and (3) MI. SI methods are suitable for the MCAR mechanism but can introduce bias when used for the MAR mechanism. For MAR, MI and MLE are preferred as they reproduce data variability and reflect the uncertainty of imputed values without introducing additional bias. The imputation methods reviewed cannot be applied to an MNAR mechanism. Thus, it is suggested that auxiliary variables be introduced to convert an MNAR mechanism to a MAR mechanism. For geostatistical analysis, MI is preferred to MLE as it generates several realizations of the missing values, which is consistent with many geostatistical simulation workflows that utilize realizations to account for subsurface uncertainty.

Advertisement

4. Conclusion

This chapter presents a detailed exploration of the complexities involved in handling multivariate data within typical geostatistical modeling workflows. It emphasizes the necessity for advanced multivariate transformations and imputation methods. The effectiveness of several transformation techniques, including PCA, MAF, SCT, and PPMT, are examined. The chapter further addresses the critical issue of missing data by exploring multiple imputation methods. These range from SI techniques to more advanced methods such as MLE and MI, considering the various missing data mechanisms.

Geospatial modeling problems continue to grow in complexity as computational resources increase, new modeling methodologies are developed, and more comprehensive multivariate sampling becomes common (drill holes, well logs, core scans, geophysical surveys, assays, etc.). Using all available data to the greatest extent possible to build the best model of a variable of interest necessitates a multivariate spatial workflow be adopted; in fact, any geostatistical problem with a single variable is now considered a ‘simple’ model. State-of-the-art geomodeling requires the incorporation of all available data and this chapter provides some of the necessary tools to enable multivariate spatial modeling of earth science variables.

References

  1. 1. Marechal A. Recovery estimation: A review of models and methods. Geostatistics for Natural Resources Characterization. 1984:385-420
  2. 2. Almeida AS, Journel AG. Joint simulation of multiple variables with a Markov-type coregionalization model. Mathematical Geology. 1994;26(5):565-588
  3. 3. Journel AG. Markov models for cross-covariances. Mathematical Geology. 1999;31(8):955-964
  4. 4. Babak O, Deutsch CV. Collocated cokriging based on merged secondary attributes. Mathematical Geoscience. 2009;41(8):921-926
  5. 5. Davis BM, Greenes KA. Estimation using spatially distributed multivariate data: An example with coal quality. Journal of the International Association for Mathematical Geology. 1983;15(2):287-300
  6. 6. Hotelling H. Analysis of a complex of statistical variables into principal components. Journal of Education & Psychology. 1933;24(6):417
  7. 7. Desbarats AJ, Dimitrakopoulos R. Geostatistical simulation of regionalized pore-size distributions using min/max autocorrelation factors. Mathematical Geology. 2000;32(8):919-942
  8. 8. Switzer P. Min/Max Autocorrelation Factors for Multivariate Spatial Imagery. Technical Report. Dept. of Statistics, Stanford University. Standford, California: Elsevier; 1984
  9. 9. Leuangthong O, Deutsch CV. Stepwise conditional transformation for simulation of multiple variables. Mathematical Geology. 2003;35(2):155-173
  10. 10. Rosenblatt M. Remarks on a multivariate transformation. Annals of Mathematical Statistics. 1952;23(3):470-472
  11. 11. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, USA: Springer Science & Business Media; 2009
  12. 12. Barnett RM, Manchuk JG, Deutsch CV. Projection pursuit multivariate transform. Mathematical Geoscience. 2014;46(3):337-359
  13. 13. Barnett RM. Managing Complex Multivariate Relations in the Presence of Incomplete Spatial Data. Edmonton, Canada: University of Alberta; 2015
  14. 14. Barnett RM, Deutsch CV. Multivariate imputation of unequally sampled geological variables. Mathematical Geoscience. 2015;47(7):791-817
  15. 15. Carranza EJM, Laborte AG. Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines). Computational Geosciences. 2015;74:60-70
  16. 16. da Silva CZ, Costa JFCL. The treatment of missing ‘not at random’ geological data for ore grade modelling. Applied Earth Science. 2019;128(1):15-26
  17. 17. Ren W. Bayesian Updating for Geostatistical Analysis. Edmonton, Canada: University of Alberta; 2007
  18. 18. Chiles JP, Delfiner P. Geostatistics: Modeling Spatial Uncertainty. Vol. 497. New Jersey, USA: John Wiley & Sons; 2009
  19. 19. Goovaerts P. On a controversial method for modeling a coregionalization. Mathematical Geology. 1994;26(2):197-204
  20. 20. Journel AG, Huijbregts CJ. Mining geostatistics. Mineralogical Magazine. 1979;43(328):563-564
  21. 21. Goulard M, Voltz M. Linear coregionalization model: Tools for estimation and choice of cross-variogram matrix. Mathematical Geology. 1992;24(3):269-286
  22. 22. Wackernagel H. Multivariate Geostatistics: An Introduction with Applications. Fontainebleau, France: Springer Science & Business Media; 2013
  23. 23. Erten O, Deutsch CV. Assessment of variogram reproduction in the simulation of decorrelated factors. Stochastic Environmental Research and Risk Assessment. 2021;35(12):2583-2604
  24. 24. Webster R, Atteia O, Dubois J. Coregionalization of trace metals in the soil in the Swiss Jura. European Journal of Soil Science. 1994;45(2):205-218
  25. 25. Goovaerts P. Geostatistics for Natural Resources Evaluation. New York: Oxford University Press on Demand; 1997
  26. 26. Goovaerts P. Spatial orthogonality of the principal components computed from coregionalized variables. Mathematical Geology. 1993;25(3):281-302
  27. 27. Mueller UA, Ferreira J. The U-WEDGE transformation method for multivariate geostatistical simulation. Mathematical Geoscience [Internet]. 2012;44(4):427-448. DOI: 10.1007/s11004-012-9384-7
  28. 28. Rondon O. Teaching aid: Minimum/maximum autocorrelation factors for joint simulation of attributes. Mathematical Geoscience. 2012;44(4):469-504
  29. 29. Dimitrakopoulos R, Fonseca MB. Assessing risk in grade-tonnage curves in a complex copper deposit, Northern Brazil, based on an efficient joint simulation of multiple correlated variables. In: Proceedings of the Computers and Operations Research in the Mineral Industry Cape Town, South Africa. Johannesburg, South Africa: The South African Institute of Mining and Metallurgy; 2003. pp. 373-382
  30. 30. Vargas-Guzmán JA. Fast modeling of cross-covariances in the LMC: A tool for data integration. Stochastic Environmental Research and Risk Assessment. 2004;18(2):91-99
  31. 31. Bailey TC, Krzanowski WJ. Extensions to spatial factor methods with an illustration in geochemistry. Mathematical Geology. 2000;32(6):657-682
  32. 32. Verly G. The multigaussian approach and its applications to the estimation of local reserves. Journal of the International Association for Mathematical Geology. 1983;15(2):259-286
  33. 33. Terrell GR, Scott DW. Variable kernel density estimation. The Annals of Statistics. 1992;20(3):1236-1265
  34. 34. Gomes CC, Boisvert J, Deutsch CV. Gaussian Mixture Models. Geostatistics Lessons [Internet]. Edmonton, Canada; 2022. Available from: http://www.geostatisticslessons.com/lessons/gmm%0D
  35. 35. Hosseini SA, Asghari O. Multivariate geostatistical simulation of the Gole Gohar iron ore deposit, Iran. Journal of the Southern African Institute of Mining and Metallurgy. 2016;116(5):423-430
  36. 36. de Figueiredo LP, Schmitz T, Lunelli R, Roisenberg M, de Freitas DS, Grana D. Direct multivariate simulation-a stepwise conditional transformation for multivariate geostatistical simulation. Computational Geosciences. 2021;147:104659
  37. 37. Friedman JH, Tukey JW. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers. 1974;100(9):881-890
  38. 38. Huber PJ. Projection pursuit. The Annals of Statistics. 1985;13(2):435-475
  39. 39. Lee EK, Cook D, Klinke S, Lumley T. Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics. 2005;14(4):831-846
  40. 40. Barnett RM, Manchuk JG, Deutsch CV. The projection-pursuit multivariate transform for improved continuous variable modeling. SPE Journal. 2016;21(6):2010-2026
  41. 41. Kloeckner J, da Silva CZ, Costa J. Covariance table and PPMT: Spatial continuity mapping of multiple variables. In: Mining Goes Digital. London: CRC Press; 2019. pp. 106-114
  42. 42. Rubin DB. Inference and missing data. New York: Biometrika;1976;63(3):581-592
  43. 43. Enders CK. Applied Missing Data Analysis. New York: Guilford Publications; 2022
  44. 44. Mack C, Su Z, Westreich D. Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide. Rockville (MD): Agency for Healthcare Research and Quality; 2018
  45. 45. Jones BG, Facer RA. CORRMAT/PROB, a program to create and test a correlation coefficient matrix from data with missing values. Computational Geosciences. 1982;8(2):191-198
  46. 46. Isaaks EH, Srivastava MR. An Introduction to Applied Geostatistics. New York: Oxford University Press; 1989
  47. 47. Wilks SS. Moments and distributions of estimates of population parameters from fragmentary samples. Annals of Mathematical Statistics. 1932;3(3):163-195
  48. 48. Buck SF. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society, Series B. 1960;22(2):302-306
  49. 49. Sulis I, Porcu M. Assessing the Effectiveness of a Stochastic Regression Imputation Method for Ordered Categorical Data. Italy, Sardinia: Centre for North South Economic Research, University of Cagliari and Sassari; 2008. p. 4. Working Paper CRENoS 200804
  50. 50. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39(1):1-22
  51. 51. Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychological Methods. 2002;7(2):147
  52. 52. Little RJA, Rubin DB. Statistical Analysis with Missing Data. Vol. 793. New Jersey: John Wiley & Sons; 2019
  53. 53. Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88(421):125-134
  54. 54. Little RJA. A class of pattern-mixture models for normal incomplete data. Biometrika. 1994;81(3):471-483
  55. 55. Collins LM, Schafer JL, Kam CM. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 2001;6(4):330
  56. 56. Huang R, Carriere KC. Comparison of methods for incomplete repeated measures data analysis in small samples. Journal of Statistical Planning and Inference. 2006;136(1):235-247
  57. 57. Rubin DB. Multiple Imputation for Nonresponse in Surveys. Hoboken. NJ: John Wiley and Sons; 1987
  58. 58. Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science. 2007;8(3):206-213
  59. 59. Zhang H, Erten O, Deutsch CV. Bayesian Updating for Combining Conditional Distributions. Edmonton, Canada: Geostatistics Lessons; 2020. Available from: http://www.geostatisticslessons.com/lessons/bayesianupdating
  60. 60. Barnett R, Deutsch C. Missing data replacement in a multi Gaussian context. CCG Annual Report. 2012;14:14

Written By

Gamze Erdogan Erten, Camilla Zacche da Silva and Jeff Boisvert

Submitted: 28 November 2022 Reviewed: 06 May 2024 Published: 14 June 2024