Open access peer-reviewed chapter - ONLINE FIRST

Perspective Chapter: Lattice Function-Based Support Vector Machine for Shape-Constrained Classification

Written By

Geng Deng, Yaoguo Xie, Xindong Wang and Qiang Fu

Submitted: 02 March 2024 Reviewed: 10 March 2024 Published: 17 June 2024

DOI: 10.5772/intechopen.1005541

Support Vector Machines - Algorithms, Optimizations, and Real-World Applications IntechOpen
Support Vector Machines - Algorithms, Optimizations, and Real-Wor... Edited by Robertas Damaševičius

From the Edited Volume

Support Vector Machines - Algorithms, Optimizations, and Real-World Applications [Working Title]

Dr. Robertas Damaševičius

Chapter metrics overview

5 Chapter Downloads

View Full Metrics

Abstract

Shape-constrained classification is an important and evolving topic within machine learning, offering insights into enhancing model accuracy and interpretability through the integration of shape information from input features. In this paper, we present a novel Lattice Support Vector Machine (Lattice-SVM) classifier, which accommodates user-defined shape constraints, including monotonicity and convexity/concavity. Lattice-SVM constructs a nonparametric nonlinear discriminant hyperplane by integrating lattice functions. We optimize the model parameters using the Pegasos algorithm for SVM, which incorporates stepwise projections to ensure the feasibility of the shape constraints. Through a series of simulation studies and real-world examples, we illustrate how Lattice-SVM enhances classification performance and effectively captures nonlinear effects by leveraging the shape information of input features.

Keywords

  • lattice regression
  • Pegasos SDG algorithm
  • support vector machine
  • shape-constrained classification
  • lattice-SVM

1. Introduction

In the field of machine learning, classification stands out as a widely discussed subject aiming to utilize an object’s features to identify the class to which the target belongs. Real-world problems frequently exhibit shape-restricted relationships in classification tasks. This shape information includes characteristics like monotonicity and convexity/concavity between input features and the target variable. For instance, the default probability of a loan typically decreases as a borrower’s credit score (FICO) rises; the incidence of certain diseases may increase with age; cost functions might exhibit concavity with rising prices; and the utility functions of risk-averse agents could be concave. Conventional classifiers, such as Linear Discriminant Analysis (LDA), logistic regression, and Support Vector Machine (SVM), fall short in effectively leveraging this prior shape information to enhance model prediction accuracy, and they can also pose challenges to model interpretability. Recent research on classification problems with monotonicity has gained notable attention. For instance, a comprehensive review of contemporary monotonic classifiers, encompassing monotonic tree methods, SVM, monotonic neural networks (NN), and so on, is provided in [1].

SVM, a fundamental supervised learning tool for classification and regression analysis [2], was initially introduced by Cortes and Vapnik [3]. It continues to hold significance as a robust and widely embraced method, standing as a cornerstone in contemporary machine learning practices. Recent applications of SVM include the integration of Inception V3 deep convolutional neural network with SVM in human posture detection [4] and prediction of Chinese tax revenue [5] using least-squares SVM. SVM, in its conventional form, constructs a hyperplane to effectively separate distinct classes within the data. While the traditional SVM employs a linear separating hyperplane, advancements in the field have introduced parametric kernel transformations that create nonlinear hyperplanes. Nevertheless, extant SVM models are confined to parametric formulations and do not account for tailored shape constraints. This limitation, characterized by the absence of domain-specific shape information, may pose model overfitting issues. Thus, a more refined methodology, capable of incorporating monotonic or more generic shape information, is preferred in practical applications.

The concept of monotonic classification has been extensively explored in various research, spanning diverse machine learning methodologies, including k-Nearest Neighbor (k-NN) [6], tree methods [7], SVM [8], Extreme Learning Machine (ELM) [9], and neural networks (NN) [10]. Pioneering this exploration, Best and Chakravarti [11] introduced isotonic regression, as an early instance of shape-restricted inference. Since its inception, researchers have developed numerous methods and algorithms for incorporating monotonic constraints into machine learning models. For example, Gutierrez et al. [12] proposed an ordinal classification/regression to handle target variables with ordered categorical values. Zhu et al. [9] introduced a generalized form of ELM to address monotonic classification problems. Hu et al. [13] contributed a rank entropy-based decision tree method accommodating monotonicity constraints. Additionally, in the realm of finance, Potharst and Feelders [14] presented a monotonically constrained tree method to model house pricing problems. Chen and Samworth [15] discussed incorporating shape-restricted constraints to Generalized Additive Models (GAM). Barley et al. [16] introduced Partially Monotone SVM (PM-SVM), which provided a more general way of generating monotonicity constraints. Xu et al. [17] enhanced this GAM model using quantile-based knots in mortgage loan default modeling, featuring automatically spline knot selection.

In this paper, we propose a novel Lattice-SVM model, inspired by the widely adopted lattice approach. The proposed Lattice-SVM leverages a lattice-based discriminant hyperplane, accommodating shape constraints such as monotonicity and convexity/concavity tailored for unimodal input data. We formulate a nonlinear discriminant hyperplane using an additive model structure (1-D lattice function) to effectively handle the main feature effects. Optimization of the target function is based on the Pegasos Stochastic Gradient Descent (SGD) algorithm [18, 19]. Furthermore, we employ a sequential projection algorithm to map the solution to shape-constrained feasible regions.

The lattice-based model represents a nonparametric approach, with the lattice being entirely defined by the lattice grid and the values at the grid points. The inclusion of shape constraint serves as a regularization mechanism, augmenting the robustness of the model fitting process. Garcia and Gupta [20] originally introduced the lattice regression, a statistical method that has found application across diverse domains [21]. Notably, Google introduces the TensorFlow Lattice (TFL) library [22], a lattice-based predictive model that incorporates shape-constrained features, complementing the renowned TensorFlow library. In our research, the Pegasos algorithm, combined with the sequential projection technique, is motivated by the same conceptual approach as TFL. Deng et al. [23] extended this idea, applying the 1-D lattice transformation to develop a Lattice-LDA classifier capable of handling various monotonicity and convexity/concavity shape information.

The traditional method of incorporating monotonic constraints into classifiers varies depending on the model being used. Tree-based classifiers such as XGBoost [24] and LightGBM [25] maintain monotonic constraints by refraining from creating new splits on features that violate these constraints. Monotonic Extreme Learning Machine (ELM) [9], on the other hand, directly enforces the gradient of the model output with respect to input features to be either positive or negative during the model parameter optimization process. However, many classifiers marketed as “monotonic” do not offer the capability to model second-order convex or concave shape constraints.

Classifiers that incorporate second-order constraints are often categorized as shape-constrained classifiers rather than monotonic classifiers. For instance, the shape-constrained Generalized Additive Model (GAM) [8] utilizes active-set optimization by splining each input feature. However, this method comes with a rigid optimization framework and is limited to accommodating only a few shape types, such as monotonic and convex/concave, while neglecting other more generic shapes like s-shape or mixtures of convex and concave shapes. In contrast, the lattice-based approach (TFL [22], Lattice LDA [23], Lattice-SVM) emerges as highly versatile due to its ability to flexibly model complex shape constraints and interaction effects (via two-dimensional and high-dimensional lattices). Nevertheless, a notable drawback of this method is its computational expense, especially when dealing with large lattice grids. Additionally, specifying the knot locations requires prior knowledge and expertise from the modeler.

We present a summary outlining the pros and cons of various shape-constrained classifiers in Table 1 below.

ClassifiersProsCons
Tree-based method (XGBoost, Light GBM)Powerful classification modelsDo not model second order shape constraints
Monotonic ELMMonotonic classifier based on efficient ELMDo not model second order shape constraints
Shape-constrained GAM
  1. Computationally efficient algorithm based on GAM

  2. Automatic knot selection

  3. Model second order shape constraints

Do not model complex shape constraints such as s-shape.
Lattice-based classifiers (TFL, Lattice LDA, Lattice SVM)
  1. Flexible to model complex shape constraints

  2. Flexible to model interaction effects

  1. Computationally expensive

  2. Requires specifying knot set

Table 1.

Comparison of shape-constrained classifiers.

The subsequent sections of this paper are structured as follows. Section 2 provides an introduction to the standard SVM theory. Following that, Section 3 delves into a comprehensive discussion of the Lattice-SVM methodology, with the application of the Pegasos algorithm to solve the optimization problem, as detailed in Section 3.3. Sections 4 and 5 present simulation and real-data studies, respectively, demonstrating that the Lattice-SVM methodology produces superior classification results when compared to other classifiers, including the standard SVM and tree methods. Finally, in Section 6, we present the conclusion of this research, along with a discussion of future research directions.

Advertisement

2. Standard linear SVM

In this section, we provide a brief overview of notations and algorithms for the standard linear SVM solver.

The training dataset is denoted as xlyl,l=1,2,,N. Each object xl includes d features, xlRd; and each label yl is selected from a two-class set, yl11. The standard SVM aims to construct a linear separating hyperplane

xfx=xTβ+β0=0E1

with the objective to separate data into two classes. The coefficients β and β0 represent the model parameters to be estimated, and the new label prediction is determined as ŷ=signfx0.

Formally, the linear SVM is formulated as an optimization problem that maximizes the margin of the band region 1/β, defined by two shifted hyperplanes xf±x=xTβ+β0=±1. Refer to Figure 1 for an illustration of the separating band of a linear SVM classifier. In situations when data are non-separable, SVM introduces new slack variables ξ=ξ1ξ2ξN and a penalty hyperparameter C as a new penalizing error term in the objective function. Notably, ξl becomes positive whenever the data point xl is incorrectly classified, as depicted in Figure 1.

Figure 1.

Illustration of linear SVM.

minβ,β012β2+Cl=1NξlE2
subjectto:ξl0,ylxlTβ+β01ξl,l

SVM is typically solved efficiently through its Lagrange dual problem, which is effectively a constrained quadratic programming problem. Alternatively, another class of methods involves directly optimizing the primal problem using the gradient decent type of algorithms. In SVM, the primal problem requires a slight modification to transform it into an unconstrained optimization problem with a hinge loss function

minβ,β0Fββ0=λ2β2+1Nl=1Nmax01ylxlTβ+β0E3

Here, the new parameter λ=1NC is a transformation of C, also serving the purpose as a regularization parameter for misclassifications. In our study, the primary algorithm is the sub-gradient decent method, the Pegasos algorithm [18, 19]. Since the hinge loss function is non-differentiable, the Pegasos algorithm’s iteration utilizes the subgradient of the primal objective function.

Advertisement

3. Lattice-SVM for shape-constrained classification

We propose an innovative Lattice-SVM methodology that extends the conventional linear hyperplane to a lattice-based hyperplane, accommodating the shape constraints. The fundamental concept involves constructing the hyperplane through additive modules comprised of 1-D lattice functions.

Recall that in the standard SVM, the linear separating hyperplane is defined as:

xfx=xTβ+β0=i=1dxiβi+β0=0,E4

where d is the number of features.

In Lattice-SVM, we introduce a nonlinear function, f, defined as an additive form of the main effects, akin to the GAMI-Net concept in [26, 27], where the authors proposed a complex xNN structure and lattice modules for the main effects.

In the new Lattice-SVM model, the discriminant hyperplane is given by:

fx=i=1drixi+β0E5

where rixi is the main effect component, and β0 is the bias term (intercept). For model identification purposes, each ri is assumed to have a zero mean over empirical distribution of the training data.

In this study, we primarily investigate five types of shape constraints, including linear, monotone increasing/decreasing, and convex/concave, as summarized in Table 2. In Lattice-SVM, the model is component-wise shape-restricted, implying that each main module rixi adheres to a specific shape constraint. These five shapes represent simple and fundamental types of shapes, readily expandable to more complex combinations such as convex increasing/decreasing or S-shapes.

Shape #Shape typeShape label
1Linearl
2Monotone increasingin
3Monotone decreasingde
4Convexcvx
5Concaveccv

Table 2.

Supported shape constraints.

In the following subsections, we will elaborate on the process of constructing the 1-D function rixi by applying the lattice transformation. Following this, we will present the formulation of the Lattice-SVM optimization problem and solve it using the Pegasos algorithm.

3.1 Lattice transformation

The concept of lattice transformation was originally introduced in a sequence of papers on Lattice regression [20]. The lattice regression model is essentially interpolating a parametric function on a regular grid of knots. Once the function values on the grid knots are defined, the function within the lattice cell is calculated simply by taking the weighted average of the vertex values of the cell. The lattice transformation maps the original data to a higher dimensional sparse matrix of weights. Instead of directly processing the source data, the transformed weight matrix effectively becomes the new model input. Google’s TFL also adopts the lattice transformation and the 1-D lattice function as a fundamental model component.

The lattice transformation is designed to address generic high-dimensional problems. It is important to note that constructing a high dimensional lattice of x1x2xd is computationally expensive. In our Lattice-SVM model, we mitigate the challenge by applying 1-D lattice transformation to each function rixi in Eq. (5).

3.1.1 1-D lattice transformation of rixi

For the main effect rixi, it is assumed there are Ki knots in the xi dimension, where the knots are defined as

X1,iX2,iXKi,iE6

X1,i and XKi,i represent the minimum and maximum bounds of the domain of xi. Although the placement of the knots is customizable, it is typically set as (equal spaced) quantiles of the data and unchanged throughout the model estimation. Consequently, each 1-D lattice of xi is fully defined using Ki knot values or Ki parameters.

The 1-D lattice is essentially a piecewise linear function.1 Without loss of generality, we omit the subscript i in this subsection, simplifying the function form to rx. We describe how to approximate the function rx using an interpolation of look-up table values on the knots. Figure 2 provides an illustration of the piecewise linear, or 1-D lattice approximation.

Figure 2.

Piecewise linear function construction of rx (equivalent to one-dimensional lattice).

The predetermined K knots X1,X2,,XK define K1 buckets in the entire domain of x. Suppose x is a data point in one bucket XkXk+1, it can be expressed as the weighted average of the preceding and subsequent knots

x=wkXk+wk+1Xk+1E7

where the linear weights are simply computed as

wk=Xk+1xXk+1Xkandwk+1=1wkE8

Let parameter vector β be the knot values of the piecewise linear function, or βk=rXk. The values of the parameter β regulate the shape of the piecewise linear function. The function value of rx is therefore the weighted average of the knot values

rx=wkβk+wk+1βk+1E9

To ensure consistency across all buckets, we introduce a weight vector ϕx of length K for any given x, including zero weights in the other buckets

ϕx=00wkwk+100E10

such that

x=ϕxTX1X2XKE11

and define the piecewise function rx of the entire domain as

rx=ϕxTβE12

The weight vector ϕx is the lattice transformation or mapping function.

Note that the vector β defines the look-up table values on each knot rXk=βk. To make function rx satisfy domain-knowledge shape constraints, it is essentially realized by altering the values of β. For instance, ensuring a monotone increasing piecewise linear function, rx, necessitates that β be monotone increasing, expressed as:

βkβk+1,k=1,2,,K1E13

The ordering constraint on β ensures monotonicity in the piecewise linear function. Similar adjustments to the values in β can be made to satisfy more intricate shape constraints, such as convex or concave shapes. The utilization of knot values in β to control the function’s shape is a distinctive feature of the lattice transformation.

3.2 Solving the optimization problem

This subsection describes the methodology to address the optimization problem. Let parameter set β1βd be the parameter for modules i=12d. β0 is the bias term (intercept). In the proposed Lattice-SVM model, the discriminant hyperplane is defined as:

fxβ=i=1drixiβi+β0E14

Upon the application of the lattice transformation on input x to ϕx, which is a matrix of weights, the original input with 1+d dimensions is transformed to a much larger dimension: 1+i=1dKi number of covariates. Given the hyperplane defined by xfxββ0=0, the lattice SVM’s objective function is the primal objective. Lattice-SVM requires additional shape constraints giβi0,i=1,,d.

The optimization problem setup of the new shape-constrained Lattice-SVM is therefore formulated as a new constrained optimization problem:

argminβ,β0Fββ0E15
subjectto:giβi0,i=1,,d

3.3 The Pegasos algorithm: stochastic subgradient descent for SVM

In this section, we provide a brief overview of the main structure of the Pegasos algorithm and delve into the implementation details.

For illustration purposes, we first omit the bias term β0. At iteration t of the algorithm, one needs to select a random training sample xltylt by picking an index lt1N, or a random mini-batch of indexes At. The objective function (3) is then approximated as

FβAt=λ2β2+1klAtmax01ylxlTβE16

where k indexes are selected from a random subset AtN=1N, At=k, at each iteration. As noted, k=N makes the optimization equivalent to the original problem (3). The subsamples in this case are referred as mini-batches.

Given the subsample set At, the subgradient of the objective function is derived as

Ft=λβ1klAt1ylxlTβ<1ylxl,E17

where 1 is the indicator function.

As mentioned earlier, the original SVM problem (3) typically includes a bias (intercept) term, β0, which is a scale parameter. The bias term is essential when the distribution of the target y is uneven, a common occurrence in practice. However, simply adding the bias parameter β0 to the model makes the objective function non-strictly convex, diminishing the efficiency of algorithm.

Several approaches have been proposed in the original paper to treat the additional bias parameter β0. One method involves introducing an optimization subproblem for β0 at each iteration given β. This effectively redefines the new objective function as

minβFβAt=λ2β2+GβS,E18

where

GβS=minβ01Nl=1Nmax01ylxlTβ+β0E19

At each iteration, a subgradient of FβAt needs to be found, and a steep decent step of β is performed. Solving the optimization subproblem is a generalized weighted median problem, but this adaptation is effective only for k=N.

We outline the main structure of the Pegasos algorithm as follows.

Algorithm 1: Pegasos algorithm with projection

Initialize β1=0Rd

for each iteration t=1,,Tdo

  Select Atm, where At=k uniformly at random.

  Optimize β0 in the optimization subproblem (19) given βt.

  Set γt=1λt, At+=lAt:ylxlTβ<1. Update the parameter vector by applying the projection algorithm

βt+1=proj1γtλβt+γtklAt+ylxl,s.t.giβi0E20

  Stop if convergence is obtained, or the max number of iteration T is reached.

end for

The constraints giβi0 are shape constraints specifying the shape of each individual feature. We introduce a stepwise projection function proj in the Pegasos algorithm to accommodate the shape constraints in the Lattice-SVM problem (14). As mentioned before, the Pegasos with projection is also motivated by the algorithm implemented in TFL [22].

In this research, we consider the following two basic types of projection algorithms designed to handle monotone shapes and convex/concave shapes:

  • Monotone increasing or decreasing shapes

The details of the algorithm are presented below. Figure 3 illustrates an example of the projection for a monotone increasing shape constraint by adjusting the step difference Δβk0.

Figure 3.

Projection for a monotone increasing solution.

Algorithm 2: Monotone projection algorithm β̂=projβ.

Given a vector β=β1β2βK.

Compute the step difference between the values

Δβk=βk+1βk,k=1,2,,K1E21

Update and output β̂

β̂k+1=β1+i=1k1min/maxΔβi0,k=1,2,,K1E22

 “max” is applied to monotone increasing shape and “min” to monotone decreasing shape.

  • Convex or concave shapes

The algorithm for projecting a vector as a convex (or concave) shape is based on Dykstra’s alternating projections algorithm. See Algorithm 3 for detailed steps.

Algorithm 3: Convex and Concave projection algorithm β̂=projβ.

Given a vector β=β1β2βK

Compute the step difference between the values

Δβk=βk+1βk,k=1,2,,K1E23

Compute the step difference of the knot locations

ΔXk=Xk+1Xk,k=1,2,,K1E24

Update Δβ̂ in two steps

Δβ̂k=min/maxΔβkΔXkΔβk+Δβk+1ΔXk+ΔXk+1,k=1,3,E25
Δβ̂k=max/minΔβkΔXk+1Δβk+Δβk+1ΔXk+ΔXk+1,k=2,4,E26

 “min and max” is applied to the convex shape, and “max and min” is applied to the concave shape

Output β̂

β̂k+1=β1+i=1k1Δβ̂k,k=1,2,,K1E27

In practice, when applying Dykstra’s alternating projection algorithm, multiple updates are required at each iteration. As suggested by the authors in TFL [22], while Dykstra’s algorithm provides proper projection with respect to the L2 norm, it approaches it from the “wrong” side, necessitating approximate projections strictly into the feasible space. To ensure the values fall into the feasible region defined by the constraints, the author set the default number of projection times to eight.

From our experience, we notice that about 10 rounds of the Dykstra’s updates can provide good approximations for convex/concave shapes. In the subsequent simulation and real-data analysis, we use a large default value (50) as the maximum number of rounds of updates.

Advertisement

4. Simulation studies

We created two simulation examples to demonstrate the performance of Lattice-SVM. In both examples, the targeting y labels 11 were simulated based on two-dimensional nonlinear separating planes fx1x2, subject to complex shape constraints. The Lattice-SVM classifier was implemented in MATLAB, and the simulation tests were conducted in a computer equipped with an Intel i7-10850H processor and 32GB of RAM.

  1. The first example applies a convex function for the domain x1,x201. The separating plane is defined as

    f1x1x2=r1x1+r2x2=x10.52+x20.52,E28

    which is “convex” for both x1 and x2.

  2. The second example uses the same domain x1,x201 and defines the separating plane as

    f2x1x2=r1x1+r2x2=log4x1+0.05+logx2+0.05,E29

which is “concave increasing” for both x1 and x2, and the value 0.05 is added to the log function to prevent negative infinity values when either x1 or x2 is 0.

When configuring Lattice-SVM, the knot set includes equal-spaced percentiles. For the first example, we used 99 knots, spanning percentiles 0.01–0.99, and for the second example, we used 19 knots, covering percentiles 0.05–0.95, alongside the inclusion of minimum and maximum points for each feature. We generated 5000 samples from a binomial distribution of the underlying function fx1x2 and set the parameter λ to 0.01.

Figures 4 and 5 present the true underlying function compared with recovered lattice function by Lattice-SVM. In both examples, we normalized the probability f1 and f2 to a range of 01 and scaled the Lattice-SVM-fitted curve to match the range of the actual functions r1 and r2.

Figure 4.

(a) Theoretical form of separating plane f1x1x2=x10.52+x20.52, (b) 2-D Lattice-SVM spline approximation of f1.

Figure 5.

(a) Theoretical form of separating plane of f2x1x2=log4x1+0.05+logx2+0.05, (b) 2-D Lattice-SVM spline approximation of f2.

From the 2-D plots in both examples, it is evident that Lattice-SVM can accurately capture the underlying function and its shape, outperforming conventional linear classifiers. Further enhancement of model fit can be achieved by increasing the sample size or employing denser knot sets, albeit at the expense of additional computational iterations. However, this approach may also lead to overfitting with an excessive number of knots. Drawing from our simulation and real-data analyses, we find that employing approximately 10 to 20 knots provides Lattice-SVM with satisfactory performance.

Advertisement

5. Real world data analysis

Next, we applied Lattice-SVM to real world data to evaluate its performance. We compared the model predicting accuracy between Lattice-SVM and several commonly used classifiers, including SVM (Linear), SVM (Gaussian kernel), Classification tree, Classification tree with Adaptive Boosting [28], and PM-SVM [16].

The datasets were sourced from the UCI Machine Learning Repository [29]. We selected four datasets, including the Wisconsin Breast Cancer (WBC), the Abalone, the Auto MPG, and the Car Evaluation data. The data statistics are summarized in Table 3, with all entries containing missing values removed. To configure the classification problem, we restricted the target to two classes (1 and − 1) for each dataset. Specifically, for the WBC data, the target class with “malignant” was set to 1 and with “benign” was set to −1. For the Abalone data, the target “Rings” is a continuous variable ranging from 1 to 29. We split it using the threshold at 10, where Rings10 was set to 1, and Rings<10 was set to −1. For Auto MPG, we set the target mpg20 to be 1, and mpg<20 to be −1. While for the Car Evaluation data, the target class with “unacc” (unacceptable) was set to −1, and the other three classes with “acc” (acceptable), “good,” and “vgood” (very good) to be 1. It is also noted that when we applied PM-SVM, the constrained set was generated based on Randomized Conjunctive (CJ1) algorithm [8].

Dataset# Obs# Features# Original ClassesSource
Wisc Breast Cancer (WBC)68392UCI
Abalone41777ContinuousUCI
Auto MPG3927ContinuousUCI
Car172864UCI

Table 3.

Data description in the real-world examples.

In our analysis, each example underwent 10-fold cross-validation. Each validation round utilized 90% of the source data as the training set and the remaining 10% as the test set. The same training and testing data were used for each cross-validation partition when evaluating other classifiers. It is noted that when employing a tree with Adaptive Boosting, we selected 100 learning cycles for each training dataset. To ensure a fair comparison among input features, we standardized each feature to have a zero mean and a standard deviation of one before optimizing Lattice-SVM. Subsequently, we transformed the results back to their original scale. Regarding knot selection, we utilized nine knots for all examples, spanning percentiles 0.1 to 0.9, in addition to the minimum and maximum points for each feature.

We evaluated the model performance by computing the average accuracy across test datasets obtained from cross-validation. In each iteration, accuracy was determined as the sum of the True Positive Rate (TP) and the True Negative Rate (TN), calculated as one minus the mean misclassification rate of the 0/1 loss function. The results, including both average accuracy and standard deviation, are presented in Table 4.

DatasetLattice-SVMSVMSVM (Gaussian)TreeTree (AdaBoost)PM-SVM
WBC0.97230.96640.96640.92530.96050.9605
± 0.0210± 0.0206± 0.0181± 0.0357± 0.0217± 0.0265
Abalone0.75560.75290.68180.70460.75390.7548
± 0.0180± 0.0198± 0.0479± 0.0147± 0.0182± 0.0188
Auto MPG0.91580.91320.91060.89280.90310.9055
± 0.0380± 0.0367± 0.0423± 0.0266± 0.0411± 0.0384
Car0.89410.88600.97690.97980.95140.9693
± 0.0273± 0.0177± 0.0105± 0.0106± 0.0132± 0.0086

Table 4.

Classification performance based on average accuracy and standard deviation of 0/1 loss.

From the WBC example, we assigned 1 (orange) to the target “class = malignant” and − 1 (blue) to “class = benign.” The data contained approximately 35% of the records with the attribute = 1. In this example, we selected Clump Thickness to be “convex increasing,” Cell Shape Uniformity to be “monotone increasing,” Bland Chromatin to be “concave increasing,” and the rest of the features to have “linear” shapes. The fitted results in Table 4 indicate that the Lattice-SVM model achieves an average prediction accuracy of approximately 0.9723, outperforming popular non-linear classifiers such as SVM (with Gaussian kernel), PM-SVM, and decision trees. Moreover, the bar plot of each feature and its marginal approximation function r̂ixi are shown in Figures 6 and 7, respectively. Notably, we observe that for Clump Thickness, the slope becomes steeper when it exceeds 8, while for Bland Chromatin, the slope gets steeper when it falls below 4, offering enhanced fitting and insights into the relationship between the probability of breast cancer and the input variables.

Figure 6.

Bar plot of each feature in the WBC example: Orange: “malignant” (1), Blue: “benign” (−1).

Figure 7.

Plot of each marginal approximation function r̂ixi in the WBC example.

For the Abalone example, the target “Rings” (Rings plus 1.5 gives the age in years) indicates the age of abalone. We assigned 1 (orange) to the target “Rings ≥10” and − 1 (blue) to “Rings <10;” about 50% of the records have the attribute = 1. The dataset contains seven input features, including length (longest shell measurement), diameter (perpendicular to length), height (with meat in shell), whole weight (whole abalone), shucked weight (weight of the meat), viscera weight (gut weight after bleeding), and shell weight (after being dried). We notice that several input features exhibit high correlations, potentially causing multicollinearity issues in the additive model structure. To mitigate this, we removed input features with correlation greater than 0.9, retaining only three features: height, viscera weight, and shell weight. In this example, we set shell weight to be “concave increasing” and the other two to be “linear.” As shown in Table 4, specifying the shape constraint solely for shell weight markedly enhances predictive accuracy, enabling the Lattice-SVM model to outperform SVM, PM-SVM, and decision tree methods. Moreover, from Figures 8 and 9, it is evident that each input feature exhibits a positive correlation with age (Rings), with the slope for shell weight steepening when it falls below 0.3 and leveling off when it exceeds 0.5.

Figure 8.

Bar plot of each feature in the Abalone example: Orange: “Rings ≥10” (1), Blue: “Rings <10” (−1).

Figure 9.

Plot of each marginal approximation function r̂ixi in the Abalone example.

For the Auto MPG example, as mentioned above, the target “mpg” was split at 20. The dataset contains seven input features, including cylinders, displacement, horsepower, weight, acceleration, model year, and the origin. Figure 10 gives the marginal approximation function of each input feature. From the plot, one can easily identify the shape information, and in this analysis, we set cylinders, weight, and model year to be linear shape; displacement and horsepower to be “convex decreasing;” and acceleration and origin to be “monotone increasing,” From Table 4, the Lattice-SVM model has an average prediction accuracy of approximately 0.9158 and outperforms the other classifiers. Moreover, it can be seen that for displacement, the slope is steeper when it is less than 200, and similarly for horsepower, the slope is steeper when the value is less than 150. While for acceleration, when the value is between 13 and 17, the slope is steeper than outside this range.

Figure 10.

Distribution of each feature in the Auto MPG example.

The fourth example is the Car Evaluation data. The target is to evaluate the cars’ condition to be unacceptable, acceptable, good, or very good. We note that in this example, with appropriate shape information, the Lattice-SVM method can reach an average prediction accuracy of 0.8941 and only performs better than linear SVM, while SVM with Gaussian kernel and tree methods have much higher accuracy. Given the sub-optimal performance of Lattice-SVM in this case, the marginal distribution and the approximation functions are not presented. One potential reason is that Lattice-SVM still follows the GAM structure and may not handle the non-linear relationship with interaction effects or other complex structures like trees or neural networks. This implies some limitations for the Lattice-SVM method and the GAM structure, which will be discussed further (Figure 11).

Figure 11.

Plot of each marginal approximation function r̂ixi in the Auto MPG example.

An advantage of Lattice-SVM observed in real-data analysis is its interpretability. Upon specifying the shape constraint, users can readily visualize the fitted marginal effect of each input feature through plots, facilitating a more intuitive understanding of the relationship between inputs and targets. It is worth noting that while employing more knots may yield more granular-fitted results, it demands greater computational resources and may lead to overfitting. Based on our experience, we recommend using 10 to 20 knots for the Lattice-SVM model to achieve both higher accuracy and computational efficiency. Additionally, it is also possible to add interaction effects, modeled by a high-dimensional lattice. However, this could substantially increase the number of parameters to optimize, making the optimization solver computationally expensive.

While the choice between convex and concave may seem intriguing initially, incorrectly selecting the opposite type (e.g., using convex types for concave data) can cause the algorithm to degenerate to a linear line without applying transformations at any knots (provided with sufficient amount of data). Therefore, careful consideration and understanding of the data’s characteristics are essential when choosing the appropriate shape constraint.

The Pegasos is a stochastic subgradient algorithm that is subject to the stochastic index set selection and subgradient calculation accuracy. Generally speaking, the computational efficiency is not as good as the solvers for traditional linear SVM and kernel-based SVM, which apply quadratic programming algorithm for the SVM dual problem. This is observed even for the standard Pegasos algorithm. Regarding the projection algorithms embedded in the main Pegasos algorithms, they typically produce robust shape-constrained mappings in our experiments.

Advertisement

6. Conclusions

In this paper, we have introduced a new Lattice Support Vector Machine (Lattice-SVM) classifier, designed to incorporate user-defined shape constraints such as monotonicity and convexity/concavity, thereby enhancing model accuracy and interpretability. Employing an additive format of lattice functions, the Lattice-SVM method effectively addresses nonlinear problems. Optimizing the model utilizes the Pegasos algorithm for SVM, integrating customized stepwise projections to ensure the specified shape constraints.

By incorporating appropriate shape constraints, users can refine model fitting and reduce prediction bias. Our research demonstrates the efficacy and robustness of the Lattice-SVM model through simulation studies and real-world examples. Notably, this method exhibits greater stability in the presence of random errors or noise in the data while maintaining model interpretability compared to other machine learning tools such as SVM or decision trees. The incorporation of shape information into the model fitting process facilitates a better understanding of the underlying relationship between specific inputs and the target variable.

Upon modification, Lattice-SVM can be easily extended to handle multi-class problems and regression with shape constraints. Let M be the number of classes, one common technique involves creating M one-versus-rest classifiers and selecting the class that best classifies the data with the largest margin. Alternatively, a set of one-versus-one classifiers can be built, and the class with the highest number of votes is chosen. The latter approach involves constructing MM1/2 classifiers.

There are several directions to enhance the Lattice-SVM model. Utilizing higher-dimensional lattices could enable capturing interactions or more complex joint effects among input features. Additionally, developing an automatic procedure to determine the optimal number of knots on each feature by incorporating penalty terms could further refine the model. Given that Lattice-SVM optimization relies on the Pegasos algorithm, which is a stochastic gradient descent method, future research could focus on optimizing the selection of regularization parameters and stopping rules to strike a balance between model accuracy and computational efficiency.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

Advertisement

Disclaimer

The opinions in this paper are strictly those of the authors and do not represent the views of Wells Fargo & Company, or any of their subsidiaries or affiliates.

References

  1. 1. Cano J-R, Gutiérrez PA, Krawczyk B, Woźniak M, García S. Monotonic classification: An overview on algorithms, performance measures and data sets. Neurocomputing. 2019;341:168-182
  2. 2. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY: Springer; 2009
  3. 3. Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273-297
  4. 4. Ogundokun RO, Maskeliūnas R, Misra S, Damasevicius R. Hybrid Inceptionv 3-SVM-based approach for human posture detection in health monitoring systems. Algorithms. 2022;15(11):410
  5. 5. Zhang D, Zheng S, Fu W. Research on the prediction model of Chinese tax revenue based on GM (1,1) and LSSVM. Information Technology and Control. 2023;52(4):811-818
  6. 6. Duivesteijn W, Feelders A. Nearest neighbour classification with monotonicity constraints. ECML/PKDD Lecture Notes in Computer Science. 2008;5211:301-316
  7. 7. Qian Y, Xu H, Liang J, Wang J, Liu B. Fusing monotonic decision trees. IEEE Transactions on Knowledge and Data Engineering. 2015;27(10):2717-2728
  8. 8. Chen C, Li ST. Credit rating with a monotonicity-constrained support vector machine model. Expert Systems with Applications. 2014;41(16):7235-7247
  9. 9. Zhu H, Tsang E, Wang XZ, Ashfaq R. Monotonic classification extreme learning machine. Neurocomputing. 2017;225:205-213
  10. 10. Daniels H, Velikova M. Monotone and partially monotone neural networks. IEEE Transactions on Neural Networks. 2010;21(6):906-917
  11. 11. Best MJ, Chakravarti N. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming. 1990;47:425-439
  12. 12. Gutierrez P, Perez-Ortiz M, Sanchez-Monedero J, Fernandez-Navarro F, Hervas-Martinez C. Ordinal regression methods: Survey and experimental study. IEEE Transactions on Knowledge and Data Engineering. 2015;28(1):127-146
  13. 13. Hu Q, Che X, Zhang L, Zhang D, Guo M, Yu D. Rank entropy-based decision trees for monotonic classification. IEEE Transactions on Knowledge and Data Engineering. 2011;24(11):2052-2064
  14. 14. Potharst R, Feelders AJ. Classification trees for problems with monotonicity constraints. ACM SIGKDD Explorations Newsletter. 2002;4(1):1-10
  15. 15. Chen Y, Samworth RJ. Generalized additive and index models with shape constraints. Journal of Royal Statistical Society, Series B. 2016;78(4):729–754
  16. 16. Bartley C, Liu W, Reynolds M. Effective monotone knowledge integration in kernel support vector machine. In: Proceedings of the 12th International Conference on Advanced Data Mining and Applications. Cham: Springer; 2016. pp. 3-18
  17. 17. Xu G, Deng G, Wang X, Fu K. Automatic spline knot selection in modeling mortgage loan default using shape constrained regression. The Journal of Structured Finance. 2021;27(3):18-36
  18. 18. Shalev-Shwartz S, Singer Y, Srebro N. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming. 2011;127(1):3-30
  19. 19. Spall JC. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Hoboken, NJ: John Wiley & Sons; 2005
  20. 20. Garcia E, Gupta M. Lattice regression. Advances in Neural Information Processing Systems (NIPS). 2009;22:1-9
  21. 21. Garcia EK, Arora R, Gupta MR. Optimized regression for efficient function evaluation. IEEE Transactions on Image Processing. 2012;21(9):4128-4140
  22. 22. Gupta M, Cotter A, Pfeifer J, Voevodski K, Canini K, Mangylov A, et al. Monotonic calibrated interpolated look-up tables. Journal of Machine Learning Research. 2016;17(109):1-47
  23. 23. Deng G, Xie Y, Wang X, Fu Q. Lattice linear discriminant analysis for shape constrained classification. Fuzzy Systems and Data Mining VIII: Proceedings of FSDM. 2022;2022(358):80-96
  24. 24. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16. San Francisco: ACM; 2016
  25. 25. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Light GBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems. 2017;30:3146-3154
  26. 26. Deng G, Xu G, Yang Z, Liang Y, Wang X, Fu Q, et al. The anatomy of mortgage default using shape-constrained explainable machine learning model. The Journal of Financial Data Science. 2023;5(4):66-85
  27. 27. Yang Z, Zhang A, Sudjianto A. GAMI-net: An explainable neural network based on generalized additive models with structured interactions. Pattern Recognition. 2021;120:108192
  28. 28. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 1997;55:119-139
  29. 29. Dua D, Graff C. UCI machine learning repository, 2017

Notes

  • In TensorFlow Lattice, the piecewise linear function is referred as a calibrator.

Written By

Geng Deng, Yaoguo Xie, Xindong Wang and Qiang Fu

Submitted: 02 March 2024 Reviewed: 10 March 2024 Published: 17 June 2024