Open access peer-reviewed chapter

Perspective Chapter: Linear Regression and Logistic Regression Models

Written By

Dilip Kumar Ghosh

Submitted: 21 September 2023 Reviewed: 23 September 2023 Published: 22 May 2024

DOI: 10.5772/intechopen.1003183

From the Edited Volume

Recent Advances in Biostatistics

B. Santhosh Kumar

Chapter metrics overview

24 Chapter Downloads

View Full Metrics

Abstract

In this chapter, we have discussed the detailed concept of simple linear regression and logistic regression analysis. Further we have discussed the procedure of computing regression coefficients, standard error, t test, Z test, p value and 95% confidence intervals for simple linear regression and logistic regression analysis. We also explained that for testing the simple linear regression coefficient, we use t test, whereas, for testing the logistic regression coefficient, we use Z test. Several examples on medical data are considered and various related statistics were computed using manually, R studio package, and Jamovi.

Keywords

  • regression model
  • logistic function
  • odds ratio
  • scatter diagram
  • regression
  • coefficients
  • estimators
  • predicted value

1. Introduction

The method of linear regression is used in predicting the value of one variable based on the value of another variable. The variable you need to predict is known as dependent variable, whereas the variable used to predict the other variable’s value is known as independent variable. This method contains one or more than one explanatory variables. If it contains only one explanatory variable is called simple linear regression, otherwise, multiple linear regression model. The regression coefficients involved in linear equation is estimated using the least squares method of estimation. Once regression coefficients are estimated, you can fit a model that predicts the value of dependent variable. Linear regression is used to establish a linear relation between response and explanatory variable in biological, behavioral, environmental, social sciences, business etc. Montogomery et al. [1], Rencher and Bruce [2], Swaminathan [3] and Lane [4] discussed regression analysis with several examples. Further, Noce and McKeown [5] discussed a logistic modeling of factors influencing internet use. While Seo et al. [6] discussed the relations between physical activity and behavioral.

Logistic regression is a statistical method that is used to establish a relationship between one dependent variable and one or more than one explanatory variables, where dependent variable is dichotomous.

Advertisement

2. Simple linear regression

Suppose random samples (xi1,xi2,,xin,y1) of size n, where i = 1, 2, …, n; is drawn from a population. The random variables (x1,x2,,xn) are generally known as predictor variables. However, depending upon situation and with practical point of view, these random variables are also known with different names. The different names are independent variables, covariates, regressor and explanatory variables. Variable y is called as response variable. Sometimes variable y is also known as dependent variable or outcome variable. Suppose we are willing to establish a linear regression between response variable y and explanatory variables (x1,x2,,xn),then it could be represented by a model,

y=β0+β1x1+β2x2++βnxn+eE1

where, β0, β1,β2,βn are parameters and is known as regression coefficients, e is random error which is distributed normally with mean zero and variance σ2.

For example, the effect of age, weight, height and walking habit on systolic blood pressure. This model is called multiple regression models as predicted variables are more than one.

Suppose we are interested in bi-variate regression model, where Y is response variable and X is predicted variable. This model is called simple linear regression model or general linear model. This model is represented by

Y=β0+β1X+eE2

Where, Y is response variable, X is predicted variable, β0 is intercept, β1 is regression coefficient and e is random variable; and e is distributed normally with mean zero and variance σ2. For example, the effect of age on systolic blood pressure.

Advertisement

3. Scatter diagram

A scatter diagram is a two-dimension graph involving the magnitude of the response variable (Y) and predicted variable (X). Scatter diagram provides a rough idea about the relationship between response and predicted variables. Following are the various steps for drawing a scatter diagram:

  1. Select the horizontal axis (X) and vertical axis (Y)

  2. Take response variable on Y-axis and predicted variable on X-axis.

  3. Tick the point at the corresponding area of (X, Y).

In the scatter diagram, if the observations are approximately scattered around the straight line, it shows a linear relationship between response and predicted variables. Once the relationship is established, one can use simple linear regression model to know the relationship between the variables. However, if the observations are not scattered around the straight line, it does not show a linear relationship between the two variables. In such situation, one can use transformation or non-linear regression method to find the best fitted regression model.

Example 1: A sample of 15 men of age group 30–70 was collected to investigate the effect of weight of the patients on the sugar level of the diabetic patients. The data on the blood sugar level (mg/dl) and weight (in kg) of 15 men are shown in Table 1.

S. no.Blood sugar levelWeight of the patients (in kg)
114650
214548
314146
416869
513240
619080
718070
813038
918175
1014859
1111030
1214754
1314651
1412035
1515565

Table 1.

Blood sugar level and duration of walk of 15 men.

Draw the scatter diagram and give your interpretation about the data.

From Figure 1, it is clear that approximately all the 15 observations are scatted around the straight line. Hence, there is linear relationship between blood sugar level and weight of the person.

Figure 1.

Scatter diagram.

3.1 Assumptions underlying linear regression model

For applying any statistical method, first of all, we should study the assumptions underlying it. So we shall discuss the assumptions of simple linear regression models. Following are the assumptions:

  1. The regression model is assumed to be linear in parameters.

  2. The error term e is assumed to be normally distributed with mean 0 and variance σ2, i.e., e σ ∼ N(0, σ2).

3.2 Estimation of parameters

For regression model (2), β1 is the parameter and is constant and unknown which can be estimated using least squares method of estimation. Model (2) can be rewritten as

yi=β0+β1xi+ei,wherei=1,2,,n.E3
E=i=1nei2=i=1nyiβ0β1xi2E4

On differentiating the error sum of squares (E) with respect to β0 and β1 and then equating them to zero, we can obtain the least squares estimator of β0 and β1.

δEδβ0=2i=1nyiβ0β1xi=0E5
i=1nyiβ0β1xi=0E6
So,i=1nyi=nβ0+β1i=1nxiE7
Similarly,δEδβ1=2i=1nyiβ0β1xixi=0E8
i=1nyixiβ0xiβ1xi2=0E9
So,i=1nyixi=β0i=1nxi+β1i=1nxi2E10

On solving (7) and (10), we get

β1̂ni=1nxi2i=1nxi2=ni=1nyixii=1nyii=1nxiE11
Hence,β1̂=ni=1nyixii=1nyii=1nxini=1nxi2i=1nxi2E12
Or,β1̂=i=1nyixiny¯x¯i=1nxi2nx¯2E13

Eq. (12) can easily be written as.

β1̂=i=1nxix¯yiy¯i=1nxix¯2E14

On dividing the numerator and denominator of the above equation by n, we have,

β1̂=CovXYVarXE15

From (7), we have ny¯ = nβ0̂ + n β1̂x¯

So,β0̂=y¯β1̂x¯E16

Here, β0̂ and β1̂ are the least square estimator of intercept β0 and slope β1. In the regression model, slope is called regression coefficient. Hence, on ward β1 will be called regression coefficient. Thus, for the linear regression method, the fitted regression model is given by

yî=β0̂+β1̂xiE17

In the matrix notation, the linear model can be written as

Y=+eE18

Where, Y is a vector of n × 1 observations, X is a matrix of n × 2, β is a vector of 2 × 1 parameters, and e is the random error of n × 1.

Using least squares method of estimation, the normal equation is obtained as

XY=XE19

On multiplying both side of (19) by XX1, we have

XX1XY=XX1XXβE20
Hence,β̂=XX1XYE21

Where, β̂ is the estimate of the regression coefficient β.

In that case, the fitted regression model is given by

Ŷ=β0̂+β1̂XE22

Example 2: A samples of 15 men of age group 30–70 was collected to investigate the effect of weight (in kg) of the patients on the blood pressure level of the diabetic patients. The data on the blood pressure level (mm/hg) and weight (in kg) of 15 men are shown in Tables 2 and 3.

S. no.Blood pressure level (mm/hg)Weight (in kg)
112550
212348
312046
418173
510540
619080
718575
811845
917574
1016869
1111043
1213054
1312851
1411644
1517972

Table 2.

Blood pressure level and weight of 15 men.

yxy squarex squarexyPredicted value of yResidualResidual sum of squares
1255015,62525006250126.8429−1.842933.396390985
1234815,12923045904122.45070.5492820.301710716
1204614,40021165520118.05851.9414943.769398952
1817332,761532913,213177.35343.64663213.29792494
1054011,02516004200104.88190.118130.013954697
1908036,100640015,200192.7261−2.726117.431675732
1857534,225562513,875181.74563.2544210.59124954
1184513,92420255310115.86242.13764.56933376
1757430,625547612,950179.5495−4.5494720.69771368
1686928,224476111,592168.5689−0.568940.323697275
1104312,10018494730111.4702−1.470192.161452755
1305416,90029167020135.6274−5.6273531.66711304
1285116,38426016528129.039−1.039041.079595809
1164413,45619365104113.66632.3337065.446183694
1797232,041518412,888175.15733.84273814.76663534
2153864322,91952,622130,2842153119.5140309

Table 3.

Predicted and residual value.

Where, residual = (y – predicted value of y).

Regression coefficient,β1̂=ni=1nyixii=1nyii=1nxini=1nxi2i=1nxi2=15x1302842153x86415x526228642=19542601860192789330746496=9406842834E23

Hence, β1̂=2.196106.

β0̂=y¯β1̂x¯=143.53332.196106×57.6=17.03763.

The fitted regression model is given by

Ŷ=17.03763+2.196106XE24

r = 0.996 and R2 = 0.991.

From model coefficient (Table 4), it is obvious that regression coefficient of predictor weight is highly significant as p < 0.001. Also R2 is very close to 1. This concludes that as the weight of the patient increases, blood pressure increases. That is, weight is under control, the blood pressure is normal or under normal. In this example, weight ranges from 69 to 80, the blood pressure ranges from 179 to 190. However, weight ranges from 40 to 54, the blood pressure ranges from 105 to 130. Again, when the weight ranges from 69 to 80, the blood pressure ranges from 168 to 190. Thus, higher the weight, higher is blood pressure.

PredictorEstimateSEtP
Intercept17.037633.36075.07<0.001
β2.1961060.056738.70<0.001

Table 4.

Model coefficient.

3.3 Regression coefficient using R studio package

>y = c(125,123,120,181,105,190,185,118,175,168,110,130,128,116,179)

>x = c(50,48,46,73,40,80,75,45,74,69,43,54,51,44,72)

>result = data.frame(y,x)

> z = lm(y ∼ x,result)

> summary(z)

Advertisement

4. Forecasting or predicted value of Y

Using the fitted model (24), we can easily forecast the blood pressure level corresponding to its weight. Suppose, weight of a patient is 82 kg then its blood pressure will be obtained from Ŷ= 17.03763 + 2.196106×82 = 197.1183.

Residuals: We can obtain the residuals of all the 15 patients using (Y – Ŷ). This value is shown in Table 3.

R Squares: We can compute R2 using the residual sum of squares from the expression

R2=1Residualsumof squarestotalsumof squares ofyE25

Where, residual sum of squares = YŶ2 = 119.5140309 and is shown in Table 3.

Total sum of squares of y = i=1nyi2i=1nyi2n = 322,919 – 2153215 = 13891.73333.

So,R2=1119.514030913891.73333=0.9914.

We can also obtain Regression coefficients using the Matrix form as following:

X=111111111111111504846734080754574694354514472
XX=1586486452622,XY=2153130284andXX1=1.228510060.02017089230.02017089230.0003501891
β̂=XX1XY=17.0376342.196106.β0̂=17.037634andβ1̂=2.196106.

4.1 R studio program to obtain the estimate of regression coefficients

>X = matrix(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,50,48,46,73,40,80,75,45,74,69,43,54,51,44,72),ncol = 2)

>X= t(X)

>XX = X%%X

>Y = c(125,123,120,181,105,190,185,118,175,168,110,130,128,116,179)

>m = solve(XX)

>XY= X%*%Y

>n=XY

>β=m%%n

Advertisement

5. Regression lines

Let the regression model is denoted by Y = a + bX, then the two types of regression lines are following:

  1. Regression line of Y on X and is denoted by bYX

  2. Regression line of X on Y and is denoted by bXY

bYX=ni=1nyixii=1nyii=1nxini=1nxi2i=1nxi2=2.196106E26
Regression line of X on Y and is denoted bybXY=ni=1nyixii=1nyii=1nxini=1nyi2i=1nyi2=0.4514E27
Advertisement

6. Expectation and variance of estimators

Eβ0̂=β0andEβ1̂=β1E28

This show that least squares estimators β0̂ and β1̂ are unbiased estimators of β0 and β1, respectively.

Vβ0̂=σ21n+x¯2sumof squares ofxE29
AndVβ1̂=X'X1σ2=σ2sumof squares ofx.E30

Where, σ2 is unknown. It is estimated from the given data as

σ2̂ = (yiyî2n2), and sum of squares of x is obtained as

Sum of squares of x = i=1nxi2i=1nxi2n; where n is number of observations.

Thus, when σ2 is unknown, V(β0̂) and V(β1̂) is determined as

Vβ0̂=σ2̂1n+x¯2sumof squares ofxandE31
Vβ1̂=σ2̂sumof squares ofx.E32

Again, when σ2 is unknown, standard error of β0̂ and β1̂ are determined as

SEβ0̂=Vβ0̂=σ2̂1n+x¯2sumof squares ofxandE33
SEβ1̂=Vβ1̂=σ2̂sumof squares ofx.E34

Example 3: Consider the example 2, compute the estimate of V(β0̂), V(β1̂) and its standard error.

Solution: From example 2, we have β0̂=17.037634and̂β1̂ = 2.196106. Sum of squares of X = i=1nxi2i=1nxi2n = 52,622 − 864215 = 52,622 – 49766.4 = 2855.6.

σ2̂=i=1n(yiyî2)n2=119.514013=9.1934.
x¯=i=1nxin=86415=57.6.

Estimates of the V(β0̂) and V(β1̂) can be determined from

Vβ0̂=σ2̂1n+x¯2sumof squares ofx=9.1934115+57.622855.6E35
So,Vβ0̂=11.2941844.
Vβ0̂=σ2̂sumof squares ofx==9.19342855.6=0.003219428E36
SEβ0̂=Vβ0̂=11.2941844=3.360682133,
SEβ1̂=Vβ1̂=0.003219428=0.056740004.
Advertisement

7. Testing of hypothesis of estimated regression coefficients β0̂ and β1̂

For testing the hypothesis that the sample comes from the population for which the value of β0 is equal to 0. That is,

Our Null hypothesis H0:β0=0

Against the alternate hypothesis H1:β00.

Under the null hypothesis, for testing H0:β0=0, we define the test statistics given by

t=β0̂Eβ0̂SEβ0̂=β0̂β0SEβ0̂=β0̂0SEβ0̂E37
=β0̂σ2̂1n+x¯2sumof squares ofx,whereσ2is unknown.E38

Similarly, for testing the hypothesis that the sample comes from the population for which the value of β1 is equal to 0. That is,

Our Null hypothesis H0:β1=0.

Against the alternate hypothesis H1:β10.

Thus, Under the null hypothesis, for testing H0:β1=0, we define the test statistics given by

t=β1̂Eβ1̂SEβ1̂=β1̂β1SEβ1̂=β1̂0SEβ1̂=β1̂σ2̂sumof squares ofxE39

where σ2 is unknown, and statistics t follows student’s t distribution with (n – 2) degrees of freedom.

Example 4: Consider the data given in example 2. Find the effect of weight on blood pressure of 15 patients, test the null hypothesis for testing the significance of the regression coefficients β0̂ and β1̂ at 5% level of significance.

Solution: From example 2, we have,

β0̂=17.037634,SEβ0̂=3.360682133,and
β1̂=2.196106,SEβ1̂=0.056740004.

Under the null hypothesis, for testing H0:β0=0, we have,

t=β0̂Eβ0̂SEβ0̂=β0̂β0SEβ0̂=β0̂0SEβ0̂E40

where σ2 is unknown.

t=17.0376343.360682133=5.069695.

Under the null hypothesis, for testing H0:β1=0, we have

t=β1̂Eβ1̂SEβ1̂=β1̂β1SEβ1̂=β1̂0SEβ1̂,whereσ2is unknown.E41
t=2.1961060.056740004=38.70472.

With α = 0.05, for two sided α, the tabulated value of t at 13 degrees of freedom with 5% level of significance is 2.160. In case of H0:β0=0, the calculated value of t (=5.069695) is greater than tabulated value of t, so test is significant. That is, we reject the null hypothesis. Hence, we may conclude that the value of β0 is not equal to zero.

Similarly, In case of H0:β1=0, the calculated value of t (=38.70472) is greater than tabulated value of t, so test is highly significant. That is, we reject the null hypothesis. Hence, we may conclude that the value of β1 is not equal to zero. Thus, the fitted simple regression model is highly significant. In other word, we can say as the weight of the patient increases, the chance of blood pressure may increase.

Alternatively, we can also test the significance of regression coefficient β1 using the analysis of variance Table 5.

Sources of variationDegrees of freedomSum of squaresMean squaresF-RatioP-value
Regression1yîy¯2SSR/df = MSRMSR/MSE
Error(n − 2)yiyî2SSE/df = MSE
Total(n – 1)yiy¯2

Table 5.

Analysis of variance.

Example 5: Consider the data given in example 2. Find the effect of weight on blood pressure of 15 patients, test the null hypothesis for testing the significance of the regression coefficients β1̂ at 5% level of significance using analysis of variance Table 5.

Using Table 3, we have Table 6.

yxPredicted value of y (yî)yîy¯2yiyî2yiy¯2
12550126.8429278.5684513.396390985343.4832089
12348122.4507444.4752640.301710716421.6164089
12046118.0585648.9651293.769398952553.8162089
18173177.35341143.79713.297924941403.753609
10540104.88191493.933040.0139546971484.815209
19080192.72612419.932567.4316757322159.154209
18575181.74561460.1783410.591249541719.487209
11845115.8624765.6787074.56933376651.9494089
17574179.54951297.1647920.69771368990.1532089
16869168.5689626.783470.323697275598.6194089
11043111.47021028.043152.1614527551124.482209
13054135.627462.503982231.66711304183.1502089
12851129.039210.0836891.079595809241.2834089
11644113.6663892.0380475.446183694758.0826089
17972175.15731000.0749714.766635341257.886809
2153864215313772.2206119.514030913891.73333

Table 6.

Regression, error and total sum of squares.

Using Table 6, we can obtain the ANOVA table as shown in Table 7.

Sources of variationDegrees of freedomSum of squaresMean squaresF-RatioP-value
Regression113772.220613772.22061498.057<0.001
Error13119.51403099.193387
Total1413891.73333

Table 7.

Analysis of variance.

From Table 7, we can observe the p value for regression coefficient is less than 0.001 as well calculated value of F is very large. Which is greater than tabulated value of F with (1, 13) degrees of freedom at 5% level of significance, where tabulated value of F is 4.67. This shows that the test is significant and hence rejects the null hypothesis.

7.1 R studio program for obtaining ANOVA table

>y = c(125,123,120,181,105,190,185,118,175,168,110,130,128,116,179)

>x = c(50,48,46,73,40,80,75,45,74,69,43,54,51,44,72)

>result = data.frame(y,x)

> av. = aov(y ∼ x,result)

> summary(av)

Advertisement

8. Confidence interval of estimated regression coefficients

Now we discuss how to obtain 100% confidence interval of estimated regression coefficients. In fact we are interested for confidence interval for regression coefficient β1 only. In fact we generally compute the confidence interval to determine the range. In case of regression coefficient β1, we wish to determine the lower and upper limit of the β1. We can obtain the 100% confidence interval of the β1as

Confidence interval for β1= β1̂ ± tn2,α SE(β1̂) for two tailed test.

Example 6: Consider the data given in example 2. Find the effect of weight on blood pressure of 15 patients. Obtain 100% confidence interval of regression coefficients β1̂.

Solution: For this data from example 2, we have,

β1̂=2.196106,SEβ1̂=0.056740004andtn2,α=2.160.
Confidence interval forβ1=2.196106±2.160×0.056740004=2.196106±0.122558409

So, lower confidence limit of β1 is 2.196106 − 0.122558409 = 2.073547489.

Upper confidence limit of β1 is 2.196106 + 0.122558409 = 2.318664306.

Thus, the confidence limit of β1 is ranges from 2.073547489 to 2.318664306.

If one is interested for determining the confidence limit of β0, the same procedure can be used.

Advertisement

9. Logistic regression

Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regressions with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest

We can also say that Logistic regression is used for predicting binary outcomes on the basis of one or more predictor variables. The concept of logistic regression is similar to the ordinary multiple linear regression. In this method we are willing to fit a best model which can determine the relationship between a response variable and one or more explanatory variables. As in the case of ordinary linear regression, the form of the model is linear with respect to the regression parameters, the same is true for the logistic regression. The only difference between the two regression is following: in logistic regression the response variable is binary (also called dichotomous), whereas in ordinary linear regression it is continuous. Logistic regression can also be called as a predictive algorithm in which by using explanatory variables one can predict the dependent variable, just like Linear regression, but with a simple difference that the dependent variable in the logistic regression should be considered as a categorical variable.

Advertisement

10. Logistic function

For understanding the logistic regression, first we must determine logistics function.

Let us consider the equation of the best fit model in the simple linear regression as

y=β0+β1xE42

where, y is response variable and x is explanatory variable.

Let us replace y by probability P which is given as

P=β0+β1xE43

In (43) value of P may be negative in some case and in some other cases, value of P may be more than one. However, value of P ranges from 0 to 1 only. This is a contradiction. To overcome this problem, we can take odds of P instead of probability. Odds of probability is defined as odd = P1P. That is, odds of probability is defined as the ratio of the probability of success and the probability of failure.

So, from (43), we have

P1P=β0+β1x.E44

As we are aware that odds will be always positive, that is odds are ranging from 0 to infinity. Again, to overcome this problem we take the log transformation because by considering log transformation it will range from − infinity to + infinity.

LogP1P=β0+β1xE45

On taking exponential on both sides of (45), we have

ExpLogP1P=expβ0+β1xE46

That is, P1Plog(e) = eβ0+β1x

P1P=eβ0+β1xE47
P=1peβ0+β1x
P+Peβ0+β1x=eβ0+β1x
P1+eβ0+β1x=eβ0+β1x
So,P=eβ0+β1x1+eβ0+β1xE48

On dividing the numerator and denominator of (48) by eβ0+β1x, we have

P=eβ0+β1x/eβ0+β1x1+eβ0+β1x/eβ0+β1x=11+eβ0+β1x/eβ0+β1xE49
Thus,P=11+eβ0+β1xE50

We can call (50) as a logistic function.

If we consider only one explanatory variable then the graph of simple linear regression will give a straight line however, the graph of logistic regression will be of S shape. This is shown in Figure 2.

Figure 2.

Shape of linear and logistic regression.

Logistic regression can also be called as a predictive algorithm in which by using explanatory variables one can predict the dependent variable, just like Linear Regression, but with a simple difference that the dependent variable in the logistic regression should be considered as a categorical variable.

Example 7: A random sample of 300 women were selected, where 300 women are either suffering with cancer or not. The response of yes will be asked from the 300 women. It is found that 225 women responded yes. The response of yes out of n sample number follows binomial distribution with parameters n and p. Obtain the odds ratio.

Solution: Out of 300 women 225 have responded yes for cancer. So, the sample proportion is

P̂=225300=0.75.

Sample proportion cannot be used for finding logistic regression and hence, we need the odds. Where odds is the ratio of proportion for two outcomes. One outcomes is “yes” and the other outcomes is “no”. Proportion of yes is 0.75, hence, proportion of no is 1 − P̂ = 1 − 0.75 = 0.25.

Oddratio ofyesandnoof women cancer=P̂1P̂=0.750.25=3.

Hence, odds are 3 to 1 that woman has cancer yes to no. Similarly, we can also say odds are 1 to 3, that is, women has cancer no to yes is 1 to 3.

Example 8: The sample proportion of women who were detected as cancer patient is 65%, whereas the sample proportion of men detected as cancer patient is 45%.

In this sample of young adult, it can be observe that the sample proportion of women detected as cancer patient is 20% higher than the sample proportion of men detected as cancer. Now we wish to analyze this data using logistic regression. In this example the predictive variable is sex which is a categorical variable. So we need to use a numeric code. The better way is to use a indicator saying whether the adult is women or not. The indicator function is defined as

x=1if the person is women0,if the person ismenE51

Since, the response is given in proportion, so we transform it into odds. There will be two odds, one for women and other for men.

Odds for womenaregivenasP̂1P̂=0.6510.65=0.650.35=1.8571.
Similarly,odds formenis givenasP̂1P̂=0.4510.45=0.450.55=0.8182.

Now we can build the logistic regression model by considering log(odds) as the linear function of the explanatory variable. Hence, logistics model is defined as

logP̂1P̂=β0+β1x,E52

where x is explanatory variable, p is the binomial proportion and β0, β1are the parameters of the logistic regression model.

Here, there are only two values of x and hence write two equations: one for women and other for men.

For women,logP̂1P̂=β0+β1×1E53
And formen,logP̂1P̂=β0+β1×0E54

Because, there is a β1in the equation of women as x = 1. This is missing in the equation of men as x = 0.

Therefore, the logistic regression model for women and men are following:

Log1.8571=β0+β1E55
Log0.8182=β0
β0+β1=0.6190E56
β0=0.20065E57

On solving (24) and (25), we have

0.20065+β1=0.6190E58

Hence, β1 = 0.6190 + 0.20065 = 0.81965.

Now the fitted logistic regression model is given by

Logoddswomen=β0+β1E59
So,oddswomen=eβ0+β1E60
Similarly,oddsmen=eβ0E61
oddswomenoddsmen=eβ0+β1eβ0=eβ1=e0.81965=2.269705.
oddswomen=2.269705×oddsmen.

That is, we can say that odds of women are 2.269705 times odds of men.

Note that, if we have indicator function as

x=0if the person is women1,if the person ismenE62

Then the sign of β1will be negative. That is,

Logoddswomen=β0.So,oddswomen=eβ0E63
Similarly,oddsmen=eβ0+β1E64
oddswomenoddsmen=eβ0eβ0+β1=eβ1=e0.81965=0.440586
oddswomen=0.440586×oddsmen.E65

Therefore, we can say odds of women are 0.440586 times odds of men.

Example 9: Hemoglobin contain of 20 patients corresponding to their age was collected at a hospital to know the relationship between hemoglobin and age. The collected observations are shown in Table 8.

Hb(g/Dl)Age (Year)Anemic(1 = yes,0 = no)
11.2151
11.3211
11.5231
16.3250
16.5260
10.1281
9.9301
17.1320
17.2340
17.9360
10.1381
11.6401
18.3430
18.6460
18.9540
19.2560
19.6580
19.9600
16.9620
17.2690

Table 8.

Level of hemoglobin and its corresponding age.

If we use a simple linear regression to find the effect of age on the response variable Hemoglobin, we obtain the following statistics using software (Table 9).

PredictorEstimateSEtP
Intercept9.3821.78755.25<0.001
Age0.1530.04203.640.002

Table 9.

Model coefficient and 95% confidence.

Here regression coefficient is significant. That is, there is significant effect of age on hemoglobin.

As we are aware that the amount of hemoglobin in whole blood is expressed in grams per deciliter (g/dl). The normal Hb level for males is 14 to 18 g/dl; and for females is 12 to 16 g/dl. When the hemoglobin level is low, the patient has anemia.

If we are interested to know whether the patient is suffering with anemia then we have to use the logistic regression method. For this we have to transform the Hemoglobin data into presence or absence of Anemia. Since, the data belongs to women patients, so if the value is less than 12, the code is 1, that is women has Anemia, while the value is more than 12, the code is 0 (no Anemia). This is shown in column 3 of the Table 8. Now we fit a logistic regression between presence/absence of anemia and actual age using the software Jamovi. The following statistics is obtained (Table 10).

PredictorEstimateSEt(Z)Odds ratioP95% confidence interval
LowerUpper
Intercept3.9932.12391.8854.2160.0600.8443483.577
Age−0.1300.0629−2.060.8780.0390.7760.994

Table 10.

Model coefficients and 95% confidence intervals.

Since p value of regression coefficient (Age) is less than 0.05 and hence test is significant. That is, as age increases, chance of Anemia decreases.

11. Testing of hypothesis and confidence intervals of logistic regression coefficient

For testing the hypothesis that the sample comes from the population for which the value of logistic regression coefficient β1 is equal to 0. That is, Our Null hypothesis H0:β1=0 against the alternate hypothesis H1:β10.

Under the null hypothesis, for testing H0:β1=0, we define the test statistics given by

Z=β1̂Eβ1̂SEβ1̂=β1̂β1SEβ1̂=β1̂0SE(β1̂=β1̂SE(β1)̂=.1300.0629=2.06.E66

We can obtain the 95% confidence interval of the logistic regression coefficient β1as

Confidence interval for β1= eβ1̂±ZSEβ1̂. That is, lower limit = eβ1̂ZSEβ1̂ and upper limit = eβ1̂+ZSEβ1̂.

For Example 9, lower limit = e0.1301.96×.0629 = 0.776 and upper limit = e0.130+1.96×.0629= 0.993.

12. Conclusions

The main objective of this chapter is to discuss about simple linear regression and logistic regression analysis. Here, we have explained how to estimate regression coefficients, its standard error, testing of hypothesis of regression coefficients, 95% confidence intervals for simple linear regression and logistic regression model. All the statistics calculated manually is verified using R studio package and Jamovi package.

Acknowledgments

Author is thankful to the management committee members of the IntechOpen, the referee and the editor of this edited book for providing me an opportunity to share my works.

References

  1. 1. Montogomery DC, Peck EA, Vining GG. Introduction to Linear Regression Analysis. Wiley. © 2019-2021 Pluripotent Limited
  2. 2. Rencher AC, Bruce Schaalje G. Linear Models in Statistics. New Jersey: John Wiley; 2008
  3. 3. Swaminathan S. Regression Detailed View. Published in Towards Data Science; 2018. Available from: https://scholar.google.com.vn/citations?view_op=view_citation&hl=vi&user=K8vtbzAAAAAJ&citation_for_view=K8vtbzAAAAAJ:d1gkVwhDpl0C
  4. 4. Lane DM. Introduction to Linear Regression, Chapter 14 Regression. Available from: https://onlinestatbook.com/2/regression/intro.html
  5. 5. Noce AA, McKeown L. A new benchmark for internet use: A logistic modeling of factors influencing internet use in Canada, 2005. Government Information Quarterly. 2008;25:462-476
  6. 6. Seo D-C et al. Relations between physical activity and behavioral and perceptual correlates among Midwestern college students. Journal of American College Health. 2007;56:187-197

Written By

Dilip Kumar Ghosh

Submitted: 21 September 2023 Reviewed: 23 September 2023 Published: 22 May 2024