AutoML Insights: Gaining Confidence to Operationalize Predictive Models

Florin Stoica; Laura Florentina Stoica

doi:10.5772/intechopen.1004861

Abstract

Automated machine learning (AutoML) tools empower organizations to unlock valuable new business insights, integrate advanced AI capabilities into applications, and enable both data scientists and non-technical experts to swiftly build predictive models. Complex models generated by AutoML can appear to operate as “black boxes.” This lack of transparency can be a major disadvantage in relation to compliance with the legislation (e.g., European Union regulations implementing a “right to explanation” of algorithmic decisions provided by artificial intelligence systems). Furthermore, in many applications one must not trust a black-box system by default. We approach the problem of confidence in models generated using AutoML frameworks in two areas: model explanation and model validation, respectively. To gain confidence in the results provided by the machine learning models provided by the AutoML pipelines, we propose a model-agnostic approach based on SHapley Additive exPlanations (SHAP) values for the interpretability of these models, from a global and local perspective. We conducted a comparative analysis of three AutoML frameworks, examining their architectures and practical aspects, which aims at transparency in the generation of automatic learning models. Transparent model generation helps stakeholders understand how models are created, leading to greater confidence in their reliability and suitability for deployment in real-world scenarios.

Keywords

AutoML
OML4Py
H2O
Auto-sklearn
SHAP

Author Information

Show +

Florin Stoica*
- “Lucian Blaga” University, Sibiu, Romania
Laura Florentina Stoica
- “Lucian Blaga” University, Sibiu, Romania

*Address all correspondence to: florin.stoica@ulbsibiu.ro

1. Introduction

In the past, many statistical models have been developed to be used for predictive purposes. These include linear and nonlinear regression models, decision trees, random forests, k-nearest neighbors, Naive Bayes, and others.

In classical statistics, models are often built as a result of a thorough understanding of the application domain. As more and more companies and organizations show a growing demand for predictive models, model properties such as flexibility, capability of internal variable selection or feature engineering, high precision of predictions, and automatic methods of optimization of hyperparameters are of great interest.

To obtain robust and flexible models, an approach based on machine learning algorithms and model ensembles is currently used [1]. In the machine learning approach, the focus shifts from gaining a deep understanding of the application domain to the construction, optimization, and validation of models. Techniques such as boosting, bootstrap aggregating, averaging, model cascading, or stacking involve combining numerous simpler models into a more complex one to achieve enhanced predictive performance.

The benefits of complex models are evident, but they also come with inherent drawbacks that need attention. Complex models can often be perceived as “black boxes.” The term “black-box” is typically applied to models with intricate structures that are challenging for humans to understand. Explaining the impact of thousands of variables on a model’s prediction can be a daunting task, if not impossible.

The architecture of a complex model, such as a deep ensemble model [2], might be opaque, making it challenging to determine if the model aligns with the characteristics of the application domain. A detailed examination of practical issues associated with complex black-box models is provided in the referenced study [3].

There are a growing number of examples of predictive models that have lost the trust of the users, due to diminishing performance over time or because they have become biased in some way.

Some relevant cases are listed below:

IBM’s Watson for Oncology has been criticized by oncologists for providing unsafe and inaccurate recommendations [4].
The Google Flu model made worse predictions after 2 years than at baseline [5, 6].
Amazon’s curriculum vitae screening system has been found to be biased against women [7].

These examples can be related to two issues that will be mentioned in the following.

First, in Ref. [8], it is noted that the widespread adoption of “black-box” algorithms across various industries posed significant challenges for companies and governments aiming to adhere to the general data protection regulation (GDPR), particularly concerning the “Right to Explanation.” This right emphasizes the entitlement to receive an explanation for the output of an automated algorithm.

Second, basing business decisions on “black-box” prediction models requires decision-makers to have confidence in such models.

The development of traditional machine learning models is resource intensive, demanding substantial domain knowledge and time to create and assess numerous models. However, this approach relies on human machine learning experts to perform manual tasks. Automated machine learning, also known as automated ML or AutoML, involves automating the laborious, iterative tasks associated with the development of machine learning models. AutoML provides methods and processes to build ML models with high scale, efficiency, and productivity all while sustaining model quality.

Automated machine learning is now widespread, and every analytics and software as a service (SaaS) product seems to incorporate AutoML features. This trend is supported and encouraged by the fact that most market analysts, such as Forrester and Gartner, consider AutoML facilities as a criterion for rating machine learning (ML) applications and services.

Significant growth in the machine learning market is expected in the coming years, from $ 346.2 million in 2020 to $ 5 billion by 2027. The market is driven by the increasing demand for effective fraud detection solutions, soaring requirement for personalized product recommendations and the growing need for predictive analytics in the economic environment [9].

The increasing popularity of machine learning applications has led to a demand for standardized machine learning methods that can be easily utilized without the need for expertise in the field. AutoML will bring major changes to the current methodology of building ML applications through automation. AutoML can automate repetitive, tedious, and time-consuming tasks for data scientists. Also, AutoML contributes to the democratization of ML, because the construction of ML pipelines is no longer restricted to data scientists, empowering domain experts to build ML pipelines on their own [10].

In the Forrester Analytics Global Business Technographics® Data and Analytics Survey, 2019, 61% among the decision makers whose companies adopt artificial intelligence technologies stated that they were in the process of implementation, had already implemented, or were involved in the development/upgrading of their automation-focused machine-learning existing solutions. Another 25% of them have planned to implement such solutions within the next year [11].

With AutoML, the time needed to prepare ML models for production can be accelerated with great ease and efficiency. Popularized by Google in 2018 as a means of accelerating the task of determining the optimal neural network architecture to complete a given deep learning (DL) task, AutoML has evolved into a more widely applicable approach for automating various machine learning tasks, encompassing data preparation, model selection, feature selection and engineering, as well as hyperparameter tuning.

As machine learning models are increasingly integrated into business-critical applications, the need to make these models more easily accessible and reproducible has greatly increased. Given that only a model running in production can bring value, time to market is a very important metric that needs to be optimized for any ML commercial project.

One may ask, what an AutoML pipeline does internally? In order to gain the trust of a business user for using in production a prediction model provided by an AutoML framework, we consider that the following questions should be answered in the affirmative:

(Q1) Is the process by which the best model was built and optimized known?
(Q2) Is that model validated by evaluating and comparing it to models provided by other AutoML pipelines or obtained by other techniques?
(Q3) There is the facility to analyze the structure of the respective model and to explain the results obtained according to the current legislative regulations?

The aim of this work is to increase confidence in using AutoML and quickly integrate it into practical applications. Our approach is based on model explanation and model validation, respectively. The model explanation involves describing the AutoML pipeline through which the optimal model was built and using specific tools to inspect the model structure and highlight the importance of features (addressing Q1 and Q3). Validation of the model is carried out by evaluating its performance, taking into account the accuracy of the predictions made with a limited time budget (to answer the second question, Q2).

Within this chapter, we described and used three state-of-the-art AutoML tools: Oracle AutoML (OML4Py), H2O AutoML, and Auto-sklearn.

OML4Py is a commercially released, fast, iteration-free AutoML pipeline. H2O AutoML and Auto-sklearn are open-source AutoML frameworks implemented in Java and Python, respectively. For each framework, quick ways of use in practice were presented, addressing all the necessary steps, from data preprocessing to the deployment of the optimal model in production. All tests were performed under rigid time constraints and with minimal specific configurations of the tools used.

The remainder of this chapter is organized as follows. In Section 2, AutoML usage scenarios in critical applications are presented. The term “critical” emphasizes the importance of these applications to the overall performance, security, and reliability of a system or business. Section 3 describes OML4Py framework, with details of the implementation of each pipeline stage. Sections 4 and 5 are presenting Auto-sklearn and H2O AutoML, respectively. Section 6 describes the dataset used in the tests and the evaluation methods of the machine learning models provided by the AutoML tools. Practical approaches to build prediction models using the three analyzed frameworks are presented in Section 7. Section 8 addresses global and local explainability using the SHAP framework. Finally, Section 9 provides the results, conclusions, and directions for future research.

The diagrams in the article are built using the business process modeling notation (BPMN) 2.0 semi-formal notation. BPMN is a standard that allows the building of a visual model that documents business procedures and expresses a workflow’s business requirements with a flowchart-like notation [12]. We used BPMN diagrams because of their expressiveness and their widespread adoption by business users to quickly and easily model business processes. In this research, by business user or decision maker, we will mean a business manager, marketing manager, or business analyst who uses information provided by an AutoML framework to make critical decisions.

2. AutoML in critical applications

The fields of application of AutoML are diverse and of great social and economic importance. We will further exemplify the applicability in areas such as healthcare, medical diagnosis, disease prediction, business analytics, predictive quality in production, real-world scenarios (e.g., crash severity prediction), and natural language processing.

The study presented in Ref. [13] aims to make diagnostic and prognostic modeling more accessible and widespread by employing automated machine learning techniques. The findings present insights into the potential of AutoML approach in enhancing healthcare modeling, emphasizing the democratization of advanced analytics for diagnostic and prognostic purposes.

The study conducted by Paladino et al. published in Ref. [14], assesses the performance of AutoML tools in heart disease diagnosis and prediction. The study focuses on the evaluation of various AutoML tools in the context of heart disease, aiming to provide insights into the efficacy of automated approaches for diagnostic and predictive tasks. The results contribute to the understanding of the capabilities of AutoML tools in the cardiovascular domain, offering valuable information for the development of accurate and efficient diagnostic models for heart diseases.

The research published in Ref. [15] evaluates the suitability of AutoML tools in the field of neuroradiology diagnostics. The conclusions contribute to understanding the performance and feasibility of automated approaches in analyzing neuroimaging data, providing insights into the potential applications and challenges of AutoML in diagnostic neuroradiology.

Musigmann et al. present a study in Ref. [16] that evaluates the application of automated machine learning tools in the domain of cancer diagnostics. The findings contribute insights into the usability and potential advantages of AutoML in the complex field of cancer diagnosis, offering valuable information for the advancement of diagnostic methodologies in oncology.

In Ref. [17] the authors explore the application of AutoML techniques in diabetes diagnosis. The research provides an overview of current approaches, evaluates performance metrics, and outlines future directions for the use of AutoML in diabetes diagnosis. The results contribute to understanding the role of automated machine learning in enhancing diagnostic capabilities for diabetes, offering insights into the current landscape and potential advancements in the field.

Krauß et al. present a study in Ref. [18] focusing on automated machine learning for predictive quality in production. The research explores the application of automated machine learning techniques to predict and enhance quality in manufacturing processes. The findings contribute to the field of predictive quality in production, offering insights into the effectiveness of AutoML in optimizing and maintaining quality standards in manufacturing settings.

A closer work to the one presented here is [19], where the author delves into automated machine learning and its role in AI-driven decision-making within the realm of business analytics. The study focuses on the application of automated machine-learning techniques to enhance decision-making processes in business analytics. The findings contribute to the understanding of how artificial intelligence, specifically automated machine learning, can be leveraged for more efficient and effective decision-making in the business domain.

The authors of Ref. [20] present an AutoML strategy based on grammatical evolution, focusing on a case study about knowledge discovery from text. The study explores the use of automated machine learning, particularly leveraging grammatical evolution, for knowledge discovery in text data. The results contribute to the intersection of AutoML and natural language processing, offering insights into the application of evolutionary algorithms for automating the process of knowledge extraction from textual information.

In their study published in Ref. [21], Angarita-Zapata et al. investigate the application of AutoML techniques in predicting crash severity in a supervised learning context. The findings contribute to the understanding of how AutoML can be employed in real-world scenarios, specifically in the context of crash severity prediction, providing insights into the performance and applicability of AutoML in this domain.

With the increasing prevalence of black-box ML models used for making critical predictions in important contexts, there is a rising demand for transparency from various stakeholders in the field of AI [22]. Nevertheless, various users have distinct preferences regarding the extent of transparency required to establish trust.

In the qualitative study conducted by Xin et al., published in Ref. [23], which is based on semi-structured interviews with skilled and experienced practitioners with formal training in applied ML, the opinion of one of the participants was: “One of the problems with Auto-ML is that they don’t give you a lot of information about what is actually going on behind the scenes, and that makes it really hard for me to trust.” The research indicates that relying solely on transparency mechanisms, such as visualization, is insufficient for achieving the necessary level of trust in high-stakes industrial settings or mission-critical projects. In these contexts, participants need to be able to provide reasoning and justification for the decisions made regarding model design and selection, emphasizing the need for more comprehensive trust-building measures.

In this perspective, in the following sections, we will present the details necessary to answer the question “How were the machine learning models created?” for the AutoML frameworks referred to in this chapter.

3. OML4Py

OML4Py is itself a component that can also be installed in the on-premises Oracle Databases and is included with Oracle Autonomous Database. OML4Py feature focuses on three main tasks: algorithm selection, feature selection, and hyperparameter tuning. OML4Py uses machine learning database algorithms to minimize data movement, allowing ML algorithms to transform and process data directly into the database. This confers several important benefits compared with traditional, external Python data interfaces:

Streamlines data access, working directly with the database without having to import data as a flat file or request exporting data from the database.
Cuts down on data latency and bypasses memory limitation issues where the Python processing engine cannot load all requested data into memory before processing that data.
Avoids the performance issues associated with single-threaded Python processes, allowing ML algorithms to run in parallel.
Limits security vulnerability by removing external database authorization requests.

3.1 Formal definition of the AutoML pipeline optimization problem

Let A=A1…AR be a set of algorithms and let the hyperparameters of each algorithm Aj have domain Λj. Given a dataset Dtrain with N samples and K features, Dtrainnk⊆Dtrain denote a subset with n≤N samples and k≤K features. Finally, let LAλjDtrainnk denote the loss that algorithm Aj achieves on Dtrainnk when trained with hyperparameters λ, where L is any user-defined misclassification rate. The aim of OML4Py is to find the combination of the best algorithm A∗, data sample Dtrain∗ and hyperparameter setting λ∗ by minimizing the average loss function L [24]:

Dtrain∗,A∗,λ∗∈argminn≤N,k≤KAj∈Aλ∈ΛjLAλjDtrainnkE1

The loss function L from Eq. (1) extends the one used in the formalization of the combined algorithm selection and hyperparameter optimization (CASH) problem [25] by including an additional dimension (adaptive data reduction).

An OML4Py pipeline consists of these four main stages depicted in Figure 1.

Figure 1.
OML4Py pipeline (adapted from [24]).

The OML4Py framework implements an iteration-free AutoML pipeline designed to provide tuned models for accurate results in the shortest running time possible [24]. These objectives are achieved by eliminating the need to repeatedly iterate over various pipeline configurations, using a forward-only approach. Many AutoML frameworks [25, 26, 27, 28, 29] are iterative, and the cyclic resumption of optimization steps increases execution time, making iterative pipelines impractical for large data sets with small time budgets [24].

OML4Py speed up the optimization process by using a single-pass feed-forward approach that addresses all four stages of the pipeline: data preprocessing, algorithm selection, adaptive data reduction specific to the selected algorithm, and hyperparameter tuning, as shown in Figure 1.

A potential disadvantage of a non-iterative approach may be loss of accuracy, which may have a more pronounced effect on larger time budgets. OML4Py minimizes these risks by meta-learning proxy models, one per ML algorithm, that accurately predict the relative ranking between different pipeline configurations.

Proxy models are preconfigured predictors that are used by OML4Py in all stages to make its pipeline iteration-free. In Ref. [24] are described 10 selected proxy models by algorithm, together with the set values of the hyperparameters, established following the meta-learning process. Proxy models need to satisfy the following requirements [24]:

Be an instance of the ML algorithm, whose best performance must be predicted;
Have relative performance relevant and competitive for the tuned model, without the need for hyperparameter tuning;
Be able to accurately determine the relative ranking of different data subsets in order to implement adaptive data reduction.

Each step of the OML4Py pipeline uses these preconfigured proxy models to achieve its goal and provide the result to the next stage: algorithm selection uses them to rank algorithms, adaptive data reduction uses them to identify the relevant subset of the dataset (rows and features), and hyperparameter optimization uses them to bootstrap optimization (model tuning).

3.2 Algorithm selection

AutoML frameworks such as Auto-sklearn [25, 26] and TPOT [29] treat an ML algorithm as another hyperparameter in a large search space, making selection of the optimal algorithm expensive by increasing the time it takes to find a reasonable solution.

OML4Py models automatic algorithm selection as a score prediction problem: the proxy models Pj,j∈1..R for all algorithms are executed simultaneously, the degree of parallelism being configurable. The average cross-validation (CV) score is then used to rank the available algorithms Aj∈A,j∈1..R.

The best algorithm A∗∈A, corresponding to the proxy model P∗ that produces the highest score, is passed onto the next stage.

In Figure 2, Dtrain' represents a sample of Dtrain limited to 50 K samples, empirically determined value that provides a reasonable trade-off between running time and the accuracy of ranking algorithms [24].

Figure 2.
OML4Py algorithm selection (adapted from [24]).

3.3 Adaptive data reduction

The purpose of this step is to determine a subset that is statistically representative of the original training dataset. Replacing the original dataset with the determined subset will have a minimal negative impact on the model score. Adaptive data reduction is done in two steps on a given dataset: row sampling followed by feature selection (column sampling), to reduce the running time of the next hyperparameter optimization stage.

3.3.1 Row sampling

The dataset Dtrain' is iteratively sampled from a small subset to the size of the complete dataset, and each sample is scored by the proxy model P∗, representing the algorithm A∗∈A selected in the previous stage of pipeline. The aim is to find the smallest sample size of the dataset Dtrain' for use in the following stages of the pipeline, without sacrificing the quality of the model. Incrementally increased subsets are considered until a score tolerance threshold is met or the original dataset is used. Figure 3 shows the row sampling procedure.

Figure 3.
OML4Py row sampling procedure (adapted from [24]).

3.3.2 Feature selection

The OML4Py feature selection procedure consists of four steps which are presented in Figure 4.

Feature ranking selected only the most important k features for evaluation using five ranking algorithms.
The size generator procedure increases with a growth factor of 20% in the previous subset of features. This approach allows us to reduce the number of subset evaluations per ranking algorithm to a value proportional to lnk, so the total number of subsets that are evaluated is Fk∼5lnk.
The proxy model evaluations on these subsets of features are carried out in parallel.
Data sample Dtrain∗ is the subset that gets the highest score after the evaluation on the proxy model P∗ provided by the algorithm selection stage.

Figure 4.
OML4Py feature selection procedure (adapted from [24]).

The feature selection procedure is based on five algorithms from reference [30], for the best possible generalization for a wide diversity of data sets: (a) mutual information between each feature and the target, (b) one-way ANOVA F-value, (c) feature importance from random forests model, (d) feature importance from AdaBoost model, and (e) the average of the normalized values from (a–d).

3.4 Hyperparameter optimization

In OML4Py, hyperparameter optimization is done by the hyperparameter optimizer module (HyperGD), a parallel, gradient-based hyperparameter optimizer, developed by the authors of paper [24].

A classical synchronous parallel hyperparameter optimizer [31] selects and evaluates a batch of hyperparameters in parallel tasks called trials. In order to rank these trials, typically is used a Bayesian algorithm [32]. Because trials are synchronized before selecting the next batch of hyperparameter values, but trial-to-trial evaluation time can differ by orders of magnitude, such an approach causes inefficient use of resources.

In contrast, HyperGD achieves a high degree of parallelism by asynchronous collection and use of the best hyperparameters from all completed trials whenever a new trial is launched, as shown in Figure 5. HyperGD implements optimizations to perform quick adjustments of hyperparameters without compromising model performance. These optimizations are possible due to the use in HyperGD of a new algorithm, called gradient-based search space reduction (GrSSR), described in Ref. [24].

Figure 5.
OML4Py asynchronous parallel HyperGD optimizer (adapted from [24]).

Two approximations are made in HyperGD that are vital for a parallel asynchronous implementation of the hyperparameter optimization process.

It is assumed that hyperparameters can be optimized independently of each other. This allows full parallelization of the search in all dimensions of the hyperparameters without any time restriction or penalty due to synchronizations.
As can be seen in Figure 5, in order to narrow down the initial search range toward the minimum of the error curve, for each hyperparameter hpi are picked P pairs of points to estimate the gradients at these points: vi1,vi2,…,viP, where each point viq,q∈1..P is matched with another point in its εi neighborhood (εi is selected befitting to the initial range of the search space of hyperparameter hpi). GrSSR estimates the direction of the minimum by finding the intersection point of the top two pairs’ gradients, which must contain at least the best three trials (providing the smallest errors) in the batch. To eagerly reduce the search space, HyperGD only waits for P_min < P point-pairs out of all trials in the current batch. Specifically, HyperGD only expects 3-point pairs from the current batch to identify the best two pairs (so far) for the GrSSR algorithm. When the search space cannot be further narrowed, HyperGD uses gradient descent to fine-tune the hyperparameter values (a five-epoch descent is performed with a learning rate of 0.1).

3.5 Time budget feature

OML4Py implements a time budget feature, to enable the optimizer to provide a model tuned as best as possible within the budgeted time frame. To accomplish this, a time budget argument is taken into account at every stage of the pipeline. Figure 6 illustrates the fallback strategy implemented to guarantee that the AutoML pipeline generates a tuned model, regardless of the specific pipeline stage at which the time budget is exhausted.

Figure 6.
The fallback strategy of OML4Py (source: authors).

If the time budget is exhausted before algorithm selection is completed, the AutoML pipeline provides NaiveBayes proxy model as the tuned model (PNB). If the budget is exhausted during the adaptive data reduction stage, the dataset sample with the highest cross-validation score will be selected (DtrainCV). In the event that the budget is depleted before reaching the hyperparameter optimization stage (HyperGD), OML4Py outputs the proxy model associated with the selected algorithm in the initial stage. In this case, the proxy model is not tuned and λP represents the original hyperparameter configuration from meta-learning.

Finally, if the time budget is exhausted during the hyperparameter optimization stage, the best hyperparameter configuration λCV is selected, based on the maximum cross-validation score.

4. Auto-sklearn

The two major issues in AutoML are the ranking of algorithms (considering that no machine learning method works best on all datasets) and hyperparameter optimization, a very important step for tuning some machine learning algorithms (e.g., K-nearest neighbors, random forest, nonlinear support vector machine). The two problems can be combined into one optimization problem. The formalization of the combined algorithm selection and hyperparameter optimization (CASH) problem is presented below.

Let A=A1…AR be a set of algorithms and let the hyperparameters of each algorithm Aj have domain Λj. Given a dataset Dtrain, it is split into K cross-validation folds Dvalid1…DvalidK and Dtrain1…DtrainK such that Dtraini=Dtrain\Dvalidi for i=1,…,K. Finally, let LAλjDtrainiDvalidi denote the loss that algorithm Aj achieves on Dvalidi when trained on Dtraini with hyperparameters λ. The CASH problem is to find the joint of the best algorithm A∗ and hyperparameter setting λ∗ that minimizes the average loss function [25]:

A∗,λ∗∈argminAj∈Aλ∈Λj1K∑i=1KLAλjDtrainiDvalidiE2

The optimization process modeled by Eq. (2) is approached in Auto-sklearn by Bayesian optimization. More precisely, Auto-sklearn uses the random-forest-based Bayesian optimization method SMAC (sequential model-based algorithm configuration) [31] to solve the CASH problem.

As can be seen in Figure 7, the Auto-sklearn pipeline begins with a meta-learning step to warmstart the Bayesian optimization procedure [25]. In the meta-learning stage, for a large number of datasets, performance data and a set of meta-features are collected to allow Auto-sklearn to determine the appropriate algorithm for a new dataset. Meta-learning is performed offline, and for each ML dataset in the dataset repository (140 datasets from the OpenML [33] repository), an instance of the ML framework with the best results on that dataset is determined and stored. Meta-learning is used to warmstart Bayesian optimization of hyperparameters, by selecting k stored configurations and using their result to seed Bayesian optimization.

Figure 7.
The Auto-sklearn pipeline with meta-learning and automated ensemble construction (adapted from [25]).

Auto-sklearn approach is to store all models trained during Bayesian hyperparameter optimization and to use an efficient post-processing method to build a weighted ensemble out of them. The adjustment of the respective weights is done using the predictions of all the individual models on a hold-out set. Following the experiments performed, it was chosen as the method of weight optimization ensemble selection [34].

The underlying ML framework of Auto-sklearn is scikit-learn [30]. The Auto-sklearn 1.0 comprises 15 classification algorithms, 14 preprocessing methods, 4 data preprocessing methods, and handles 110 conditional hyperparameters (are active only if their respective component is selected) [25].

Auto-sklearn 2.0 [26] uses successive halving bandits (SH) [35] as an alternative to allocate more resources to promising ML pipelines. This approach is used to increase efficiency for cases with tight resource limitations by aggressively pruning poor-performing pipelines.

5. H2O AutoML

The H2O AutoML interface streamlines the machine learning workflow by incorporating pipeline stages for the automatic training and tuning of a diverse set of candidate models within a user-defined time limit. The outcome of the AutoML run is a “leaderboard,” which is a ranked list of optimized models that can be stored for future use in a production environment.

A set MP=MPii=1..k of prespecified models is included to give reliable defaults for implemented algorithms Ai,i=1,..,k. For each algorithm Ai, the most important hyperparameters with their defined ranges are denoted by λPiΔ,i=1,..,k. Random search is utilized to generate models MPi∗ with hyperparameters λPi∗∈λPiΔ,i=1,..,k. The order of the algorithms, Pi,i=1,..,k, is set to start with models that consistently provide good results (e.g., XGBoost, generalized linear models (GLM) models). After this set of prescribed models is trained and added to the leaderboard, a random search and fine-tuning are performed across those same algorithms. The proportion of time Ti spent on each algorithm Ai,i=1,..,k is explicitly defined to allocate more time to certain algorithms (e.g., XGBoost gradient boosting machines (GBM), H2O GBM) compared to others (e.g., H2O deep learning), based on the perceived or estimated “value” of each specific task. If a better model is found, it is added to the leaderboard. The H2O AutoML pipeline is depicted in Figure 8.

Figure 8.
H2O AutoML pipeline (source: Authors).

After training the base models, H2O’s stacked ensemble algorithm is used to train two stacked ensemble models [36]. Ensemble methods are machine learning techniques that can provide better accuracy by combining several base models instead of using a single model.

The All Models ensemble encompasses all the models generated, while the Best of Family ensemble comprises the best-performing model from each algorithm class or family. More precisely that includes one H2O GBM, XGBoost GBM, H2O extremely randomized tree forest, H2O random forest, H2O GLM, and H2O deep learning model [16].

By default, H2O utilizes a super learner algorithm [37] to train the meta-learner in the stacked ensemble. This involves using the k-fold cross-validated predictions from the base learners. In cases where the training data are very large or there is a time-dependency across rows in the dataset, for training of the meta-learner, a holdout blending frame is used [27].

6. Benchmarking experiment

The experiment compares the OML4Py, Auto-sklearn, and H2O AutoML frameworks under standard usage conditions but with an imposed time limit for providing the optimal ML model.

The problem we are solving is to identify customers with a higher likelihood of churning from MovieStream streaming services to a different movie streaming company. MovieStream is a fictitious movie streaming service described in Ref. [38].

The dataset for this task has 175 features and 130,000 observations.

It is necessary to evaluate the performance of classification models in order to use in production those models with the best results, for solving real-world problems. The classification models were evaluated using gain charts and various performance metrics based on confusion matrix.

6.1 Model evaluation using a gain chart

Gain charts can be used to evaluate predictive machine learning models by charting modeling statistics in a visualization tool. Gain is a metric that measures the effectiveness of a classification model. It is calculated as the ratio between the results obtained by using the model and the results obtained without it (random results).

In our example, the machine learning model is used to identify customers with a higher likelihood of churning from MovieStream streaming services to a different movie streaming company. We will call such a client with an actual value of churning equal to “yes” as being a “positive target.” Each prediction is made with a probability (prediction confidence) and observations are arranged in the decreasing order of prediction probabilities of positive targets.

The cumulative gains chart shows the percentage of the overall number of positive targets “gained” by targeting a percentage of the total number of cases. The dataset used to draw a gain chart has the following columns [39]:

Cumulative Gain—the ratio of the cumulative number of positive targets up to that percentile, to the total number of positive targets.
Gain Chart Baseline—the overall response rate: the line represents the percentage of positive records we expect to get if the prediction is made randomly.
Ideal Model Line—the ratio of the cumulative number of positive targets to the total number of positive targets.
Optimal Gain—this indicates the optimum number of customers to contact through a marketing campaign to avoid churning. The cumulative gain curve will flatten beyond this point.

The gain chart can be used to analyze statistics generated by machine learning classification models to determine the best model to use. The closer the cumulative gains line is to the top-left corner of the chart, the greater the gain. This indicates a higher proportion of positive targets reached for a lower proportion of customers considered.

The Optimal Gain is the maximum length segment between the Cumulative Gain and Gain Chart Baseline curves and can be used to evaluate the performance of the classification model. The farther above the baseline the Cumulative Gain curve lies, the greater the gain.

6.2 Performance metrics based on confusion matrix

Using the confusion matrix, various metrics can be calculated to evaluate the performance of the machine learning algorithms by classifying the data into its corresponding labels.

Confusion matrix C is a square matrix where Cij represents the number of data instances that are known to be in the actual group i (true label) and predicted to be in group j (predicted label) [1]. In the case of binary classification, i,j∈01 and the interpretation of the elements of the confusion matrix is presented in Table 1.

C0,0=TN	(True Negative): The count of data instances where both the actual and predicted labels are negative.
C1,1=TP	(True Positive): The count of data instances where both the actual and predicted labels are positive.
C0,1=FP	(False Positive): The count of data instances where the actual label is negative, but the model predicts it as positive.
C1,0=FN	(False Negative): The count of data instances where the actual label is positive, but the model predicts it as negative.

Table 1.

Definition of the confusion matrix elements.

The values of elements of the confusion matrix were used to calculate the classification metrics presented in Table 2.

Accuracy=TN+TPTN+FP+TP+FN	Precision=TPTP+FP
Recall=TPTP+FN	F1Score=2∗Precision∗RecallPrecision+Recall

Table 2.

Performance measures in machine learning classification models.

7. Evaluation of AutoML models: performance and explainability

We will present practical aspects of building, tuning, and training models that will then be used in predictions on test data.

An important step is to save these trained models for later use. We will also present some aspects related to data preprocessing, depending on the framework used.

7.1 OML4Py practical approach

OML4Py is available in ML notebooks in cloud environments but can also be installed in on-premises databases [40].

ML notebooks is a collaborative web-based development environment, built on top of Apache Zeppelin, for creating machine learning notebooks. Notebooks work with interpreters in the backend running SQL statements, PL/SQL scripts, and Python scripts.

The whole AutoML process, from the selection of the algorithm to the production deployment of the tuned model can be found at https://github.com/rc-iit/AutoML.

Model evaluation can be done using gain charts (represented in Figure 9) and performance metrics (whose values are presented in Table 3).

Figure 9.
Gain charts are used to evaluate the performance of the best four classification models provided by OML4Py.

Model	Dataset	Accuracy	Precision	Recall	F1 Score
GLMR	Train	0.857	0.069	0.921	0.129
GLMR	Test	0.858	0.071	0.907	0.131
SVMG	Train	0.910	0.108	0.958	0.195
SVMG	Test	0.904	0.102	0.912	0.183
DT	Train	0.994	0.901	0.551	0.684
DT	Test	0.993	0.867	0.515	0.646
RF	Train	0.995	0.966	0.540	0.693
RF	Test	0.994	0.950	0.502	0.657

Table 3.

Performance metrics are used to evaluate the performance of the top-ranked OML4Py classification models.

The cream color marks the best results on the test data and the green color highlights the best results on the training data.

The performance metrics are based on the confusion matrices described in Figure 10. The results were obtained by scoring the ML models using the test dataset as input.

Figure 10.
Confusion matrices representing predictions vs. actuals on test data for each of the four classification models. (a) GLMR model; (b) SVMG model; (c) DT model; (d) RF model.

7.2 Auto-sklearn practical approach

Auto-sklearn is not supported on Windows operating system, but a Linux distribution installed in Windows Subsystem for Linux can be used. The steps required to obtain the best model are described at https://github.com/rc-iit/AutoML.

The gain chart and the confusion matrix obtained by applying the ensemble model on the test data set are presented in Figures 11 and 12, respectively.

Figure 11.
Gain chart for the Auto-sklearn model.

Figure 12.
Confusion Matrix on test data for the Auto-sklearn model.

The PipelineProfiler package is a very useful tool for interactive analysis and inspection of various models created by Auto-sklearn. For such a model, the steps and algorithms used for its construction are displayed, and by clicking on one is drawn a flowchart of the pipeline. The code for displaying the interactive chart is shown below:

import PipelineProfiler_

profiler_data= PipelineProfiler.import_autosklearn(model)

PipelineProfiler.plot_pipeline_matrix(profiler_data)

As can be seen from Figure 13, the ensemble model provided by Auto-sklearn is based on several algorithms (e.g., AdaBoost, random forest, etc.).

Figure 13.
The interactive visualization tool for inspecting the ensemble model is provided by Auto-sklearn.

The performance metrics of the best Auto-sklearn model are presented in Table 4.

Model	Dataset	Accuracy	Precision	Recall	F1 Score
OML4Py	Train	0.910	0.108	0.958	0.195
SVMG	Test	0.904	0.102	0.912	0.183
OML4Py	Train	0.995	0.966	0.540	0.693
RF	Test	0.994	0.950	0.502	0.657
Auto-sklearn	Train	0.997	0.984	0.749	0.851
Ensemble model	Test	0.995	0.946	0.612	0.743
H2O AutoML	Train	0.999	0.956	0.975	0.966
GBM	Test	0.995	0.814	0.714	0.761
H2O AutoML	Train	0.999	0.971	0.971	0.971
Ensemble model	Test	0.994	0.874	0.705	0.780

Table 4.

Performance metrics of the best models provided by analyzed AutoML pipelines.

The cream color marks the best results on the test data and the green color highlights the best results on the training data.

7.3 H2O practical approach

H2O’s AutoML module provides classes and functions that perform a large number of modeling-related tasks using a few lines of code. The tool can be utilized for automating the machine learning workflow, encompassing the automatic training and tuning of numerous models within a time limit specified by the user. In the following, we will use H2O AutoML for auto-model selection and tuning. Then, the best model (“leader”) is used to predict the output.

The best classification model provided by H2O AutoML is based on gradient boosting machine (GBM) algorithm. Using the H2O’s stacked ensemble method, the Best of Family ensemble model provides the best results.

H2O explainability interface is a convenient wrapper to a number of explainability methods and visualizations in H2O. The explain() function produces a list of explanations, each representing individual units of explanation, such as a partial dependence plot, a feature importance plot, or a SHapley Additive exPlanations (SHAP) summary of top tree-based model.

xplain_model = aml.leader.explain(htrain)

The feature importance plot (Figure 14) shows the overall importance of the most important features in the model.

Figure 14.
The feature importance plot for the GBM model.

SHAP summary plot (Figure 15) shows the contribution of the features for each instance (row of data).

Figure 15.
SHAP summary plot for the GBM model.

In the SHAP chart, the x-axis stands for SHAP value, and the y-axis has all the features. A positive SHAP value means a positive impact on prediction, leading the model to predict 1 (e.g., churning from streaming services). Negative SHAP value means negative impact, leading the model to predict 0 (e.g., customer did not churn from streaming services).

Each point on the chart corresponds to one SHAP value for a prediction and a specific feature. The color scheme indicates the magnitude of the feature value, with red indicating a higher value and blue representing a lower value.

The gain charts and the confusion matrices obtained by applying the best models on the test dataset are presented in Figures 16 and 17, respectively.

Figure 16.
Gain charts for the H2O’s AutoML models. (a) H2O GBM model and (b) H2O ensemble model.

Figure 17.
Confusion matrices on test data for the H2O’s AutoML models. (a) H2O GBM model and (b) H2O ensemble model.

The performance metrics of the H2O leader models are presented in Table 4.

The implementation details for obtaining the results presented in this section can be found at https://github.com/rc-iit/AutoML.

8. Model-agnostic explainability using SHAP framework

Model explanation consists of using specialized tools to examine the model architecture, highlight the significance of features, and interpret the predictions generated by machine learning models. The SHAP framework, introduced in Ref. [41], is designed to provide a thorough explanation of the outcomes generated by any machine learning model. Shapley values can be regarded as a metric indicating the importance of each individual input feature’s contribution to the predicted values of the model.

The global interpretability of the SHAP values enables comprehension of the overall influence of each feature on the predictions made by the model. SHAP charts associated with the concept of global interpretability were presented in Figures 14 and 15, respectively.

By utilizing the set of SHAP values assigned to each instance in the dataset, local interpretability seeks to explain why a specific instance receives a particular prediction. It aims to provide a detailed account of the individual contributions of the predictors in that specific context. This approach significantly improves the transparency of the predictive model and substantially increases confidence in the predictions made.

The waterfall plot offers a visual representation illustrating how various factors and predictors contribute incrementally to the final prediction. Figure 18a illustrates an example of a customer predicted to have a probability of 0.28 for churning from streaming services. Figure 18b presents the impact of the predictors for a customer with a probability of 0.64 for churning from streaming services.

Figure 18.
Waterfall plots for the H2O’s AutoML model (GBM) (a) predicted probability = 0.28 (b) predicted probability = 0.64 for churning from streaming services.

On the x-axis, E[f(x)] denotes the average predicted values across the testing dataset. The bars on the y-axis are arranged in descending order based on the absolute importance of the impact of features on the predicted value. A red bar signifies a positive contribution to the predicted value, while a bar of a different color indicates a negative contribution from the corresponding feature. The label on a bar indicates the deviation (positive or negative) from the baseline model prediction value assigned to that feature.

Decision plots illustrate the deviation of the feature values, the model prediction, and the direction of deviation of features from the reference values.

The direction of deviation also indicates whether the feature positively influences the model’s decision or if it has a negative impact. When the direction of deviation is toward the right, it suggests that the feature is positively influencing the model outcome. Conversely, if the direction of deviation is toward the left, it signifies the negative influence of the feature on the model outcome.

The diagrams in Figure 19 illustrate the use of decision plots to compare the same data instances represented in Figure 18, for providing local explainability.

Figure 19.
Decision plots for the H2O’s AutoML model (GBM) (a) predicted probability = 0.28 (b) predicted probability = 0.64 for churning from streaming services.

For explanation of models provided by AutoML frameworks, which automatically selects the prediction algorithms, it is very important to calculate SHAP values using a model-agnostic method. The SHAP framework incorporates a universal SHAP explainer, which utilizes a kernel-based estimation approach for Shapley values. This functionality is provided by the KernelExplainer class, allowing its application to any machine learning model.

The SHAP framework plays a pivotal role in enhancing the interpretability of machine learning models, thereby facilitating responsible and ethical use of AI systems [42]. Details related to the mathematical framework underlying the calculation of SHAP values are presented in Ref. [43]. Also, several types of SHAP diagrams are presented and explained, and the necessary code is provided to implement wrappers for H2O and Auto-sklearn models for use with KernelExplainer.

The implementation details related to the explainability of AutoML models using SHAP framework can be found at https://github.com/automl-mets/MetS.

9. Results and discussion

Analyzing the data in Table 4, we notice that there is no model that has the best values for all performance metrics. However, we must point out that this comparison is not the main purpose of this work. A comprehensive comparative analysis of the performance of various AutoML frameworks (TPOT, hpsklearn, Auto-sklearn, Random, ATM, and H2O), can be found in Ref. [10].

It should also be noted that accuracy has been used as a score metric. This metric, used to evaluate the relative performance of optimized models, must be chosen based on the problem being solved and may change the ranking order of the determined top models.

All three analyzed frameworks provided the prediction models in a limited interval of five minutes.

The experiments were carried out on an Intel Core i7-5500U 2.40GHz configuration (mobile version) with 16Gb RAM. Figure 20 contains information about the time required by each AutoML framework for generating and optimizing the ML models.

Figure 20.
The time (in seconds) required for each framework to generate and optimize the prediction models.

Machine learning models are trained using historical data, and when deployed in real-world scenarios, they can become outdated, resulting in a decline in accuracy over time. This phenomenon is known as drift. Model drift refers to the degradation of a machine learning model’s predictive performance over time due to changes in the environment, data distribution, or other external factors. To mitigate model drift, it is crucial to monitor and regularly retrain models. In the AutoML approach, updating the prediction model involves not only retraining but also the potential change of the prediction algorithm itself.

In this perspective, setting a time limit for the generation of prediction models by AutoML frameworks is crucial, particularly in the context of systems operating in near-real-time mode. This feature allows for efficient management of computational resources and ensures timely model deployment.

From Figure 20, it can be deduced that the iteration-free architecture of OML4Py has an important advantage in the speed of generating predictive models. However, due to its distributed architecture, it is expected that H2O will perform better if high computational resources are available.

Certainly, an important question arises: can increasing the time budget lead to the generation of more efficient models? The allocation of additional time could potentially result in better-optimized models, providing an avenue for improved prediction performance. It highlights the trade-off between computational resources and the quality of the models produced within a given timeframe. Balancing computational efficiency with the need for optimal model performance becomes a critical consideration.

Figure 21 presents the comparative results of the performance evaluation of all generated models, using gain charts. It is observed that OML4Py DT model, although not mentioned in Table 4, should be taken into account for the gain chart evaluation metric.

Figure 21.
The maximum gain and the percentage of the population for which it is achieved.

Building good ML pipelines is a long and expensive endeavor, requiring advanced expertise provided by data scientists and domain experts. For this reason, practitioners often use a suboptimal default ML pipeline.

In this research we aimed to increase confidence in the use of AutoML frameworks, presenting some architectural details and comparative results, obtained without special optimizations, generally using the default parameters recommended by those systems.

To accelerate successful model deployment, the practical steps required for each framework analyzed were presented at https://github.com/rc-iit/AutoML. Each of the three systems has qualities that recommend it as the selected AutoML framework for a future project.

OML4Py minimizes the movement of data, allowing ML algorithms to transform and process data directly in the database. Using database virtual columns, the trained models can be used for scoring new data from a table on the fly. The iteration-free architecture provides speed to the execution of this framework. In the performed experiment, it consumed only half of the allocated time budget. Nevertheless, additional tests should be conducted to compare the performance of OML4Py with other AutoML frameworks, especially when relaxing time constraints.

H2O is a distributed machine learning platform designed to scale to very large datasets, with APIs in R, Python, and Java. The H2O AutoML interface was first released in June 2017 and is constantly evolving and improving with each new version of H2O. H2O AutoML is also available through a web GUI called H2O flow. H2O uses in-memory data compression, allowing the handling of billions of rows of data in-memory, even with a small cluster. H2O can instantiate a small cluster that can run on a local desktop or scale using multiple nodes with Hadoop, an amazon elastic compute cloud (EC2) cluster, or spark. In the experiment we performed, H2O obtained the most accurate predictions. It is expected that H2O will optimally exploit distributed or high-performance computing architectures, compared to other AutoML solutions.

Auto-sklearn is a robust AutoML framework that uses as underlying ML framework scikit-learn [30]. Modified versions of the Auto-sklearn 1.0 won the first and second ChaLearn AutoML challenges, which evaluated AutoML systems in a systematic way under rigid time and memory constraints (the AutoML systems were required to deliver predictions in less than 20 minutes) [26]. The developments of the system have materialized in Auto-sklearn 2.0, which is still in the experimental stage and for this reason it has not been used in this work.

A comparative analysis of the three analyzed frameworks is presented in Table 5.

Criteria	OML4Py	Auto-sklearn	H2O AutoML
Free use	Yes (cloud free tier)	Yes (open source)	Yes (open source)
Implementation languages	Python, PL/SQL, SQL	Python	Java
Support for Model Explainability	MLX (machine learning explainability) Python module	sklearn.inspection module	H2O explainability interface
GUI Interface	AutoML UI (Cloud)	No	H2O Flow – Web-based (GUI) user interface
Web-based interactive development environment	ML notebooks, based on Apache Zeppelin technology.	Yes (e.g., Jupyter notebook)	Yes (e.g., Jupyter notebook)
Tight database integration	Allows saving a model in the database, restoring a model from the database, and accessing a model from native SQL functions.	No	No
Support for parallelism	OML4Py pipeline exploit both internode and intranode parallelism.	Uses Dask. distributed library for distributed computing in Python.	It is a distributed, in-memory machine learning platform using Java Fork/Join framework for multi-threading
Model persistence	Allows saving/restoring the ML models in/from BLOB (binary large object) fields of database tables.	Open neural network exchange format (ONNX) using sklearn-onnx, predictive model markup language (PMML) using sklearn2pmml or joblib for simple serialization/de-serialization of Python objects (models)	Allows conversion of the ML models to a plain old Java object (POJO) format or a Model ObJect Optimized (MOJO) format.
Model deployment as a REST API	REST API for embedded Python execution allows execution of user-defined Python functions and model deployment through a REST endpoint.	No native support, third-party solutions can be used, e.g., Flask-RESTful	H2O REST API allows access to all H2O capabilities by a client.
Disadvantages	In Cloud Free Tier, computing resources are limited (1 CPU which supports the simultaneous run of two threads).	May require a very large amount of computation, on small data sets, unless a time limit is imposed [44].	Requires a 64-bit java runtime environment in any usage scenario (R, Python, standalone).
Advantages	“Moves the algorithms to the data” – by delivering in-database machine learning algorithms for model building and scoring data where the data resides. Scoring data with ML models stored in the database using SQL queries is very fast.	It is based on machine learning models from the scikit-learn machine learning library. Confirmed results in competitions (won first and second ChaLearn AutoML challenges).	Highly scalable, distributed engine, written in Java. The data are read in parallel and distributed in the H2O cluster for processing by ML algorithms, which are implemented on top of H2O’s distributed. Map/reduce framework. H2O Flow, H2O’s web user interface, is a very versatile tool that can effectively exploit the H2O AutoML capabilities.

Table 5.

A brief comparison of OML4Py, Auto-sklearn, and H2O AutoML.

Of course, there are many AutoML frameworks with outstanding results, which were not considered in this article. A brief presentation of some of them can be found in Ref. [45].

However, the use of an AutoML pipeline in production should be analyzed with care. For instance, AutoGluon-Tabular is an AutoML framework that implements robust data processing (to provide end-to-end AutoML capabilities), is based on deep learning and accomplish powerful model ensembling using multi-layer stacking and repeated k-fold bagging. Although the results presented in Ref. [45] underline the outstanding performance of this pipeline in the respective conditions, at least two aspects must be considered.

First, AutoGluon implicitly uses only six standard models (neural networks, LightGBM and CatBoost boosted trees, random forests, extremely randomized trees, and k-nearest neighbors) for training, tuning, and ensembling. Adding another model to the set of models used in the AutoGluon pipeline must be done manually by the user.

Second, AutoGluon-Tabular trains model sequentially and thus does not efficiently exploit parallel architectures. This fact can affect the quality of predictions in the case of limited-time budgets.

In conclusion, it can be said that complex AutoML systems generally operate as “black boxes,” meaning their techniques to build ML models are hidden from users. Therefore, decision-makers may not trust the provided results.

In this research, we have addressed these issues by presenting some architectural and implementation details of different AutoML frameworks, as well as various tools for explaining, evaluating, and inspecting the generated models, with the aim of helping stakeholders to gain confidence in AutoML predictive models by finding appropriate answers to the questions Q1, Q2, and Q3 stated above.

From a practical standpoint in a business environment, repeatability of AutoML output given the same set of inputs is often critical for global compliance, confidence, and explainability. This is an important criterion that we will consider in the future in the analysis and development of the AutoML frameworks.

Acknowledgments

This work is financed by Lucian Blaga University of Sibiu through the research grant LBUS-IRG-2023.

References

1. Shafiabady N, Hadjinicolaou N, Din FU, Bhandari B, Wu RMX, Vakilian J. Using artificial intelligence (AI) to predict organizational agility. PLoS One. 2023;18(5):e0283066. DOI: 10.1371/journal.pone.0283066
2. Ganaie MA, Minghui H, Malik AK, Tanveer M, Suganthan PN. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence. 2022;115:105151. DOI: 10.1016/j.engappai.2022.105151
3. O’Neil C. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York, USA: Crown Publishing Group; 2016. 272 p
4. Ross C, Swetliz I. IBM’s Watson supercomputer recommended “unsafe and incorrect” cancer treatments, internal documents show. In STAT; 25 July 2018 [Internet]. Available from: https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/ [Accessed: January 26, 2024]
5. Salzberg S. Why Google Flu Is a Failure [Internet]. 2014. Available from: https://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/ [Accessed: January 26, 2024]
6. Lazer D, Kennedy R, King G, Vespignani A. The parable of Google flu: Traps in big data analysis. Science. 2014;343(6176):1203-1205. DOI: 10.1126/science.1248506
7. Dastin J. Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against Women. In Reuters [Internet]. 2018. Available from: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G [Accessed: January 26, 2024]
8. Goodman B, Flaxman S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine. 2017;38(3):50-57. DOI: 10.1609/aimag.v38i3.2741
9. Research and Markets. Global Automated Machine Learning (AutoML) Market Report [Internet]. 2023. Available from: https://www.researchandmarkets.com/reports/5896115/global-automated-machine-learning-automl [Accessed: January 26, 2024]
10. Zoller M, Huber MF. Benchmark and survey of automated machine learning frameworks. Journal of Artificial Intelligence Research. 2021;70:409-474. DOI: 10.1613/jair.1.11854
11. Carlsson K. Your Friendly Neighborhood AutoML-Empowered Data Scientist [Internet]. 2020. Available from: https://www.forrester.com/blogs/your-friendly-neighborhood-automl-empowered-data-scientist [Accessed: January 26, 2024]
12. Business process model and notation™ (BPMN™) Version 2.0. The Object Management Group (OMG) [Internet]. 2024. Available from: https://www.omg.org/spec/BPMN [Accessed: January 26, 2024]
13. Imrie F, Cebere B, McKinney EF, van der Schaar M. AutoPrognosis 2.0: Democratizing diagnostic and prognostic modeling in healthcare with automated machine learning. PLOS Digital Health. 2023;2(6):e0000276. DOI: 10.1371/journal.pdig.0000276
14. Paladino LM, Hughes A, Perera A, Topsakal O, Akinci TC. Evaluating the performance of automated machine learning (AutoML) tools for heart disease diagnosis and prediction. AI. 2023;4(4):1036-1058. DOI: 10.3390/ai4040053
15. Musigmann M, Akkurt BH, Krähling H, Nacul NG, Remonda L, Sartoretti T, et al. Testing the applicability and performance of auto ML for potential applications in diagnostic neuroradiology. Scientific Reports. 2022;12(1):13648. DOI: 10.1038/s41598-022-18028-8
16. Musigmann M, Nacul NG, Kasap DN, Heindel W, Mannil M. Use test of automated machine learning in cancer diagnostics. Diagnostics. 2023;13(14):2315. DOI: 10.3390/diagnostics13142315
17. Zhuhadar LP, Lytras MD. The application of AutoML techniques in diabetes diagnosis: Current approaches, performance, and future directions. Sustainability. 2023;15(18):13484. DOI: 10.3390/su151813484
18. Krauß J, Pacheco BM, Zang HM, Schmitt RH. Automated machine learning for predictive quality in production. Procedia CIRP. 2020;93:443-448. DOI: 10.1016/j.procir.2020.04.039
19. Schmitt M. Automated machine learning: AI-driven decision making in business analytics. Intelligent Systems with Applications. 2023;18:200188. DOI: 10.1016/j.iswa.2023.200188
20. Estevez-Velarde S, Gutiérrez Y, Montoyo A, Almeida-Cruz Y. AutoML strategy based on grammatical evolution: A case study about knowledge discovery from text. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019. pp. 4356-4365
21. Angarita-Zapata JS, Maestre-Gongora G, Fajardo Calderín J. A case study of AutoML for supervised crash severity prediction. In: Joint Proceedings of the 19th World Congress of the International Fuzzy Systems Association (IFSA), the 12th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), and the 11th International Summer School on Aggregation Operators (AGOP). Atlantis Press; 2021. pp. 187-194. DOI: 10.2991/asum.k.210827.026
22. Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion. 2020;58:82-115. DOI: 10.1016/j.inffus.2019.12.012
23. Xin D, Wu EY, Lee DJL, Salehi N, Parameswaran A. Whither AutoML? Understanding the role of automation in machine learning workflows. In: CHI Conference on Human Factors in Computing Systems (CHI '21), 8–13 May 2021; Yokohama, Japan. New York, NY, USA: ACM; 2021. p. 16. DOI: 10.1145/3411764.3445306
24. Yakovlev A, Moghadam HF, Moharrer A, Cai K, Chavoshi N, Varadarajan V, et al. Oracle AutoML: A fast and predictive AutoML pipeline. Proceedings of the VLDB Endowment. 2020;13(12):3166-3180. DOI: 10.14778/3415478.3415542
25. Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F. Efficient and robust automated machine learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS'15); December 2015. Vol. 2. Cambridge, MA, USA: MIT Press; 2015. pp. 2755-2763
26. Feurer M, Eggensperger K, Falkner S, Lindauer M, Hutter F. Auto-sklearn 2.0: Hands-free AutoML via meta-learning. Journal of Machine Learning Research. 2022;23(1):11936-11996. DOI: 10.5555/3586589.3586850
27. LeDell E, Poirier S. H2O AutoML: Scalable automatic machine learning. In: 7th ICML Workshop on Automated Machine Learning (ICML 2020) [Internet]. Vienna, Austria: International Conference on Machine Learning; 12-18 July 2020. Available from: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf [Accessed: January 26, 2024]
28. Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research. 2017;18(25):1-5
29. Olson RS, Bartley N, Urbanowicz RJ, Moore JH. Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’16); 20–24 July 2016. NY, USA: ACM; 2016. pp. 485-492
30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12(85):2825-2830
31. Hutter F, Hoos HH, Leyton-Brown K. Sequential model-based optimization for general algorithm configuration. In: Coello CAC, editor. Learning and Intelligent Optimization. LION 2011. Lecture Notes in Computer Science. Vol. 6683. Berlin, Heidelberg: Springer; 2011. pp. 507-523. DOI: 10.1007/978-3-642-25566-3_40
32. Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE. 2016;104(1):148-175. DOI: 10.1109/JPROC.2015.2494218
33. Vanschoren J, van Rijn JN, Bischl B, Torgo L. OpenML: Networked science in machine learning. SIGKDD Explorations. 2014;15(2):49-60. DOI: 10.1145/2641190.2641198
34. Caruana R, Niculescu-Mizil A, Crew G, Ksikes A. Ensemble Selection from Libraries of Models. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML '04); 4–8 July 2004; Banff Alberta, Canada. New York: Association for Computing Machinery; 2004. p. 18
35. Karnin Z, Koren T, Somekh O. Almost optimal exploration in multi-armed bandits. Proceedings of Machine Learning Research. 2013;28(3):1238-1246
36. H2O Stacked Ensembles [Internet]. 2023. Available from: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html [Accessed: January 26, 2024]
37. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1):25. DOI: 10.2202/1544-6115.1309
38. Integrate, Analyze and Act on All Data using Autonomous Database [Internet]. 2024. Available from: https://apexapps.oracle.com/pls/apex/r/dbpm/livelabs/view-workshop?wid=889 [Accessed: January 26, 2024]
39. Varsha Saini. Model Evaluation Using Lift and Gain Analysis – Lift and Gain Charts [Internet]. 2022. Available from: https://varshasaini.in/model-evaluation-using-lift-and-gain-analysis-lift-and-gain-charts/ [Accessed: January 26, 2024]
40. OML4Py – AutoML – An Example [Internet]. 2021. Available from: https://oralytics.com/2021/03/15/oml4py-automl-an-example/ [Accessed: January 26, 2024]
41. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17); 4–9 December 2017; Long Beach. California, USA: Curran Associates Inc; 2017. pp. 4768-4777
42. Vishwarupe V, Joshi PM, Mathias N, Maheshwari S, Mhaisalkar S, Pawar V. Explainable AI and interpretable machine learning: A case study in perspective. Procedia Computer Science. 2022;204:869-876. DOI: 10.1016/j.procs.2022.08.105
43. Boitor O, Stoica F, Mihăilă R, Stoica LF, Stef L. Automated machine learning to develop predictive models of metabolic syndrome in patients with periodontal disease. Diagnostics (Basel). 2023;13(24):3631. DOI: 10.3390/diagnostics13243631
44. Auto-sklearn API [Internet]. 2022. Available from: https://automl.github.io/auto-sklearn/master/api.html [Accessed: January 26, 2024]
45. Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, et al. AutoGluon-tabular: Robust and accurate AutoML for structured data. In: 7th ICML Workshop on Automated Machine Learning (ICML 2020). Vienna, Austria: International Conference on Machine Learning; 2020. Available from: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_7.pdf [Accessed: January 26, 2024]

[1] 1. Shafiabady N, Hadjinicolaou N, Din FU, Bhandari B, Wu RMX, Vakilian J. Using artificial intelligence (AI) to predict organizational agility. PLoS One. 2023;18(5):e0283066. DOI: 10.1371/journal.pone.0283066

[2] 2. Ganaie MA, Minghui H, Malik AK, Tanveer M, Suganthan PN. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence. 2022;115:105151. DOI: 10.1016/j.engappai.2022.105151

[3] 3. O’Neil C. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York, USA: Crown Publishing Group; 2016. 272 p

[4] 4. Ross C, Swetliz I. IBM’s Watson supercomputer recommended “unsafe and incorrect” cancer treatments, internal documents show. In STAT; 25 July 2018 [Internet]. Available from: https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/ [Accessed: January 26, 2024]

[5] 5. Salzberg S. Why Google Flu Is a Failure [Internet]. 2014. Available from: https://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/ [Accessed: January 26, 2024]

[6] 6. Lazer D, Kennedy R, King G, Vespignani A. The parable of Google flu: Traps in big data analysis. Science. 2014;343(6176):1203-1205. DOI: 10.1126/science.1248506

[7] 7. Dastin J. Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against Women. In Reuters [Internet]. 2018. Available from: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G [Accessed: January 26, 2024]

[8] 8. Goodman B, Flaxman S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine. 2017;38(3):50-57. DOI: 10.1609/aimag.v38i3.2741

[9] 9. Research and Markets. Global Automated Machine Learning (AutoML) Market Report [Internet]. 2023. Available from: https://www.researchandmarkets.com/reports/5896115/global-automated-machine-learning-automl [Accessed: January 26, 2024]

[10] 10. Zoller M, Huber MF. Benchmark and survey of automated machine learning frameworks. Journal of Artificial Intelligence Research. 2021;70:409-474. DOI: 10.1613/jair.1.11854

[11] 11. Carlsson K. Your Friendly Neighborhood AutoML-Empowered Data Scientist [Internet]. 2020. Available from: https://www.forrester.com/blogs/your-friendly-neighborhood-automl-empowered-data-scientist [Accessed: January 26, 2024]

[12] 12. Business process model and notation™ (BPMN™) Version 2.0. The Object Management Group (OMG) [Internet]. 2024. Available from: https://www.omg.org/spec/BPMN [Accessed: January 26, 2024]

[13] 13. Imrie F, Cebere B, McKinney EF, van der Schaar M. AutoPrognosis 2.0: Democratizing diagnostic and prognostic modeling in healthcare with automated machine learning. PLOS Digital Health. 2023;2(6):e0000276. DOI: 10.1371/journal.pdig.0000276

[14] 14. Paladino LM, Hughes A, Perera A, Topsakal O, Akinci TC. Evaluating the performance of automated machine learning (AutoML) tools for heart disease diagnosis and prediction. AI. 2023;4(4):1036-1058. DOI: 10.3390/ai4040053

[15] 15. Musigmann M, Akkurt BH, Krähling H, Nacul NG, Remonda L, Sartoretti T, et al. Testing the applicability and performance of auto ML for potential applications in diagnostic neuroradiology. Scientific Reports. 2022;12(1):13648. DOI: 10.1038/s41598-022-18028-8

[16] 16. Musigmann M, Nacul NG, Kasap DN, Heindel W, Mannil M. Use test of automated machine learning in cancer diagnostics. Diagnostics. 2023;13(14):2315. DOI: 10.3390/diagnostics13142315

[17] 17. Zhuhadar LP, Lytras MD. The application of AutoML techniques in diabetes diagnosis: Current approaches, performance, and future directions. Sustainability. 2023;15(18):13484. DOI: 10.3390/su151813484

[18] 18. Krauß J, Pacheco BM, Zang HM, Schmitt RH. Automated machine learning for predictive quality in production. Procedia CIRP. 2020;93:443-448. DOI: 10.1016/j.procir.2020.04.039

[19] 19. Schmitt M. Automated machine learning: AI-driven decision making in business analytics. Intelligent Systems with Applications. 2023;18:200188. DOI: 10.1016/j.iswa.2023.200188

[20] 20. Estevez-Velarde S, Gutiérrez Y, Montoyo A, Almeida-Cruz Y. AutoML strategy based on grammatical evolution: A case study about knowledge discovery from text. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019. pp. 4356-4365

[21] 21. Angarita-Zapata JS, Maestre-Gongora G, Fajardo Calderín J. A case study of AutoML for supervised crash severity prediction. In: Joint Proceedings of the 19th World Congress of the International Fuzzy Systems Association (IFSA), the 12th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), and the 11th International Summer School on Aggregation Operators (AGOP). Atlantis Press; 2021. pp. 187-194. DOI: 10.2991/asum.k.210827.026

[22] 22. Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion. 2020;58:82-115. DOI: 10.1016/j.inffus.2019.12.012

[23] 23. Xin D, Wu EY, Lee DJL, Salehi N, Parameswaran A. Whither AutoML? Understanding the role of automation in machine learning workflows. In: CHI Conference on Human Factors in Computing Systems (CHI '21), 8–13 May 2021; Yokohama, Japan. New York, NY, USA: ACM; 2021. p. 16. DOI: 10.1145/3411764.3445306

[24] 24. Yakovlev A, Moghadam HF, Moharrer A, Cai K, Chavoshi N, Varadarajan V, et al. Oracle AutoML: A fast and predictive AutoML pipeline. Proceedings of the VLDB Endowment. 2020;13(12):3166-3180. DOI: 10.14778/3415478.3415542

[25] 25. Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F. Efficient and robust automated machine learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS'15); December 2015. Vol. 2. Cambridge, MA, USA: MIT Press; 2015. pp. 2755-2763

[26] 26. Feurer M, Eggensperger K, Falkner S, Lindauer M, Hutter F. Auto-sklearn 2.0: Hands-free AutoML via meta-learning. Journal of Machine Learning Research. 2022;23(1):11936-11996. DOI: 10.5555/3586589.3586850

[27] 27. LeDell E, Poirier S. H2O AutoML: Scalable automatic machine learning. In: 7th ICML Workshop on Automated Machine Learning (ICML 2020) [Internet]. Vienna, Austria: International Conference on Machine Learning; 12-18 July 2020. Available from: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf [Accessed: January 26, 2024]

[28] 28. Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research. 2017;18(25):1-5

[29] 29. Olson RS, Bartley N, Urbanowicz RJ, Moore JH. Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’16); 20–24 July 2016. NY, USA: ACM; 2016. pp. 485-492

[30] 30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12(85):2825-2830

[31] 31. Hutter F, Hoos HH, Leyton-Brown K. Sequential model-based optimization for general algorithm configuration. In: Coello CAC, editor. Learning and Intelligent Optimization. LION 2011. Lecture Notes in Computer Science. Vol. 6683. Berlin, Heidelberg: Springer; 2011. pp. 507-523. DOI: 10.1007/978-3-642-25566-3_40

[32] 32. Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE. 2016;104(1):148-175. DOI: 10.1109/JPROC.2015.2494218

[33] 33. Vanschoren J, van Rijn JN, Bischl B, Torgo L. OpenML: Networked science in machine learning. SIGKDD Explorations. 2014;15(2):49-60. DOI: 10.1145/2641190.2641198

[34] 34. Caruana R, Niculescu-Mizil A, Crew G, Ksikes A. Ensemble Selection from Libraries of Models. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML '04); 4–8 July 2004; Banff Alberta, Canada. New York: Association for Computing Machinery; 2004. p. 18

[35] 35. Karnin Z, Koren T, Somekh O. Almost optimal exploration in multi-armed bandits. Proceedings of Machine Learning Research. 2013;28(3):1238-1246

[36] 36. H2O Stacked Ensembles [Internet]. 2023. Available from: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html [Accessed: January 26, 2024]

[37] 37. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1):25. DOI: 10.2202/1544-6115.1309

[38] 38. Integrate, Analyze and Act on All Data using Autonomous Database [Internet]. 2024. Available from: https://apexapps.oracle.com/pls/apex/r/dbpm/livelabs/view-workshop?wid=889 [Accessed: January 26, 2024]

[39] 39. Varsha Saini. Model Evaluation Using Lift and Gain Analysis – Lift and Gain Charts [Internet]. 2022. Available from: https://varshasaini.in/model-evaluation-using-lift-and-gain-analysis-lift-and-gain-charts/ [Accessed: January 26, 2024]

[40] 40. OML4Py – AutoML – An Example [Internet]. 2021. Available from: https://oralytics.com/2021/03/15/oml4py-automl-an-example/ [Accessed: January 26, 2024]

[41] 41. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17); 4–9 December 2017; Long Beach. California, USA: Curran Associates Inc; 2017. pp. 4768-4777

[42] 42. Vishwarupe V, Joshi PM, Mathias N, Maheshwari S, Mhaisalkar S, Pawar V. Explainable AI and interpretable machine learning: A case study in perspective. Procedia Computer Science. 2022;204:869-876. DOI: 10.1016/j.procs.2022.08.105

[43] 43. Boitor O, Stoica F, Mihăilă R, Stoica LF, Stef L. Automated machine learning to develop predictive models of metabolic syndrome in patients with periodontal disease. Diagnostics (Basel). 2023;13(24):3631. DOI: 10.3390/diagnostics13243631

[44] 44. Auto-sklearn API [Internet]. 2022. Available from: https://automl.github.io/auto-sklearn/master/api.html [Accessed: January 26, 2024]

[45] 45. Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, et al. AutoGluon-tabular: Robust and accurate AutoML for structured data. In: 7th ICML Workshop on Automated Machine Learning (ICML 2020). Vienna, Austria: International Conference on Machine Learning; 2020. Available from: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_7.pdf [Accessed: January 26, 2024]

AutoML Insights: Gaining Confidence to Operationalize Predictive Models

The New Era of Business Intelligence [Working Title]

Abstract

Keywords

Author Information

Florin Stoica*

Laura Florentina Stoica

1. Introduction

2. AutoML in critical applications

3. OML4Py

3.1 Formal definition of the AutoML pipeline optimization problem

Figure 1.

3.2 Algorithm selection

Figure 2.

3.3 Adaptive data reduction

3.3.1 Row sampling

Figure 3.

3.3.2 Feature selection

Figure 4.

3.4 Hyperparameter optimization

Figure 5.

3.5 Time budget feature

Figure 6.

4. Auto-sklearn

Figure 7.

5. H2O AutoML

Figure 8.

6. Benchmarking experiment

6.1 Model evaluation using a gain chart

6.2 Performance metrics based on confusion matrix

Table 1.

Table 2.

7. Evaluation of AutoML models: performance and explainability

7.1 OML4Py practical approach

Figure 9.

Table 3.

Figure 10.

7.2 Auto-sklearn practical approach

Figure 11.

Figure 12.

Figure 13.

Table 4.

7.3 H2O practical approach

Figure 14.

Figure 15.

Figure 16.

Figure 17.

8. Model-agnostic explainability using SHAP framework

Figure 18.

Figure 19.

9. Results and discussion

Figure 20.

Figure 21.

Table 5.

Acknowledgments

References