Perspective Chapter: Linear Regression and Logistic Regression Models

Dilip Kumar Ghosh

doi:10.5772/intechopen.1003183

Abstract

In this chapter, we have discussed the detailed concept of simple linear regression and logistic regression analysis. Further we have discussed the procedure of computing regression coefficients, standard error, t test, Z test, p value and 95% confidence intervals for simple linear regression and logistic regression analysis. We also explained that for testing the simple linear regression coefficient, we use t test, whereas, for testing the logistic regression coefficient, we use Z test. Several examples on medical data are considered and various related statistics were computed using manually, R studio package, and Jamovi.

Keywords

regression model
logistic function
odds ratio
scatter diagram
regression
coefficients
estimators
predicted value

Author Information

Show +

Dilip Kumar Ghosh*
- Marwadi University, Rajkot, Gujarat, India

*Address all correspondence to: ghosh_dkg@yahoo.com

1. Introduction

The method of linear regression is used in predicting the value of one variable based on the value of another variable. The variable you need to predict is known as dependent variable, whereas the variable used to predict the other variable’s value is known as independent variable. This method contains one or more than one explanatory variables. If it contains only one explanatory variable is called simple linear regression, otherwise, multiple linear regression model. The regression coefficients involved in linear equation is estimated using the least squares method of estimation. Once regression coefficients are estimated, you can fit a model that predicts the value of dependent variable. Linear regression is used to establish a linear relation between response and explanatory variable in biological, behavioral, environmental, social sciences, business etc. Montogomery et al. [1], Rencher and Bruce [2], Swaminathan [3] and Lane [4] discussed regression analysis with several examples. Further, Noce and McKeown [5] discussed a logistic modeling of factors influencing internet use. While Seo et al. [6] discussed the relations between physical activity and behavioral.

Logistic regression is a statistical method that is used to establish a relationship between one dependent variable and one or more than one explanatory variables, where dependent variable is dichotomous.

2. Simple linear regression

Suppose random samples (xi1,xi2,…,xin,y1) of size n, where i = 1, 2, …, n; is drawn from a population. The random variables (x1,x2,…,xn) are generally known as predictor variables. However, depending upon situation and with practical point of view, these random variables are also known with different names. The different names are independent variables, covariates, regressor and explanatory variables. Variable y is called as response variable. Sometimes variable y is also known as dependent variable or outcome variable. Suppose we are willing to establish a linear regression between response variable y and explanatory variables (x1,x2,…,xn),then it could be represented by a model,

y=β0+β1x1+β2x2+…+βnxn+eE1

where, β0, β1,β2,…βn are parameters and is known as regression coefficients, e is random error which is distributed normally with mean zero and variance σ2.

For example, the effect of age, weight, height and walking habit on systolic blood pressure. This model is called multiple regression models as predicted variables are more than one.

Suppose we are interested in bi-variate regression model, where Y is response variable and X is predicted variable. This model is called simple linear regression model or general linear model. This model is represented by

Y=β0+β1X+eE2

Where, Y is response variable, X is predicted variable, β0 is intercept, β1 is regression coefficient and e is random variable; and e is distributed normally with mean zero and variance σ2. For example, the effect of age on systolic blood pressure.

3. Scatter diagram

A scatter diagram is a two-dimension graph involving the magnitude of the response variable (Y) and predicted variable (X). Scatter diagram provides a rough idea about the relationship between response and predicted variables. Following are the various steps for drawing a scatter diagram:

Select the horizontal axis (X) and vertical axis (Y)
Take response variable on Y-axis and predicted variable on X-axis.
Tick the point at the corresponding area of (X, Y).

In the scatter diagram, if the observations are approximately scattered around the straight line, it shows a linear relationship between response and predicted variables. Once the relationship is established, one can use simple linear regression model to know the relationship between the variables. However, if the observations are not scattered around the straight line, it does not show a linear relationship between the two variables. In such situation, one can use transformation or non-linear regression method to find the best fitted regression model.

Example 1: A sample of 15 men of age group 30–70 was collected to investigate the effect of weight of the patients on the sugar level of the diabetic patients. The data on the blood sugar level (mg/dl) and weight (in kg) of 15 men are shown in Table 1.

S. no.	Blood sugar level	Weight of the patients (in kg)
1	146	50
2	145	48
3	141	46
4	168	69
5	132	40
6	190	80
7	180	70
8	130	38
9	181	75
10	148	59
11	110	30
12	147	54
13	146	51
14	120	35
15	155	65

Table 1.

Blood sugar level and duration of walk of 15 men.

Draw the scatter diagram and give your interpretation about the data.

From Figure 1, it is clear that approximately all the 15 observations are scatted around the straight line. Hence, there is linear relationship between blood sugar level and weight of the person.

3.1 Assumptions underlying linear regression model

For applying any statistical method, first of all, we should study the assumptions underlying it. So we shall discuss the assumptions of simple linear regression models. Following are the assumptions:

The regression model is assumed to be linear in parameters.
The error term e is assumed to be normally distributed with mean 0 and variance σ2, i.e., e σ ∼ N(0, σ2).

3.2 Estimation of parameters

For regression model (2), β1 is the parameter and is constant and unknown which can be estimated using least squares method of estimation. Model (2) can be rewritten as

yi=β0+β1xi+ei,wherei=1,2,…,n.E3

E=∑i=1nei2=∑i=1nyi−β0−β1xi2E4

On differentiating the error sum of squares (E) with respect to β0 and β1 and then equating them to zero, we can obtain the least squares estimator of β0 and β1.

δEδβ0=−2∑i=1nyi−β0−β1xi=0E5

∑i=1nyi−β0−β1xi=0E6

So,∑i=1nyi=nβ0+β1∑i=1nxiE7

Similarly,δEδβ1=−2∑i=1nyi−β0−β1xixi=0E8

∑i=1nyixi−β0xi−β1xi2=0E9

So,∑i=1nyixi=β0∑i=1nxi+β1∑i=1nxi2E10

On solving (7) and (10), we get

β1̂n∑i=1nxi2−∑i=1nxi2=n∑i=1nyixi−∑i=1nyi∑i=1nxiE11

Hence,β1̂=n∑i=1nyixi−∑i=1nyi∑i=1nxin∑i=1nxi2−∑i=1nxi2E12

Or,β1̂=∑i=1nyixi−ny¯x¯∑i=1nxi2−nx¯2E13

Eq. (12) can easily be written as.

β1̂=∑i=1nxi−x¯yi−y¯∑i=1nxi−x¯2E14

On dividing the numerator and denominator of the above equation by n, we have,

β1̂=CovXYVarXE15

From (7), we have ny¯ = nβ0̂ + n β1̂x¯

So,β0̂=y¯−β1̂x¯E16

Here, β0̂ and β1̂ are the least square estimator of intercept β0 and slope β1. In the regression model, slope is called regression coefficient. Hence, on ward β1 will be called regression coefficient. Thus, for the linear regression method, the fitted regression model is given by

yî=β0̂+β1̂xiE17

In the matrix notation, the linear model can be written as

Y=Xβ+eE18

Where, Y is a vector of n × 1 observations, X is a matrix of n × 2, β is a vector of 2 × 1 parameters, and e is the random error of n × 1.

Using least squares method of estimation, the normal equation is obtained as

X′Y=X′XβE19

On multiplying both side of (19) by X′X−1, we have

X′X−1X′Y=X′X−1X′XβE20

Hence,β̂=X′X−1X′YE21

Where, β̂ is the estimate of the regression coefficient β.

In that case, the fitted regression model is given by

Ŷ=β0̂+β1̂XE22

Example 2: A samples of 15 men of age group 30–70 was collected to investigate the effect of weight (in kg) of the patients on the blood pressure level of the diabetic patients. The data on the blood pressure level (mm/hg) and weight (in kg) of 15 men are shown in Tables 2 and 3.

S. no.	Blood pressure level (mm/hg)	Weight (in kg)
1	125	50
2	123	48
3	120	46
4	181	73
5	105	40
6	190	80
7	185	75
8	118	45
9	175	74
10	168	69
11	110	43
12	130	54
13	128	51
14	116	44
15	179	72

Table 2.

Blood pressure level and weight of 15 men.

y	x	y square	x square	xy	Predicted value of y	Residual	Residual sum of squares
125	50	15,625	2500	6250	126.8429	−1.84293	3.396390985
123	48	15,129	2304	5904	122.4507	0.549282	0.301710716
120	46	14,400	2116	5520	118.0585	1.941494	3.769398952
181	73	32,761	5329	13,213	177.3534	3.646632	13.29792494
105	40	11,025	1600	4200	104.8819	0.11813	0.013954697
190	80	36,100	6400	15,200	192.7261	−2.72611	7.431675732
185	75	34,225	5625	13,875	181.7456	3.25442	10.59124954
118	45	13,924	2025	5310	115.8624	2.1376	4.56933376
175	74	30,625	5476	12,950	179.5495	−4.54947	20.69771368
168	69	28,224	4761	11,592	168.5689	−0.56894	0.323697275
110	43	12,100	1849	4730	111.4702	−1.47019	2.161452755
130	54	16,900	2916	7020	135.6274	−5.62735	31.66711304
128	51	16,384	2601	6528	129.039	−1.03904	1.079595809
116	44	13,456	1936	5104	113.6663	2.333706	5.446183694
179	72	32,041	5184	12,888	175.1573	3.842738	14.76663534
2153	864	322,919	52,622	130,284	2153		119.5140309

Table 3.

Predicted and residual value.

Where, residual = (y – predicted value of y).

Regression coefficient,β1̂=n∑i=1nyixi−∑i=1nyi∑i=1nxin∑i=1nxi2−∑i=1nxi2=15x130284−2153x86415x52622−8642=1954260−1860192789330−746496=9406842834E23

Hence, β1̂=2.196106.

β0̂=y¯−β1̂x¯=143.5333–2.196106×57.6=17.03763.

The fitted regression model is given by

Ŷ=17.03763+2.196106XE24

r = 0.996 and R2 = 0.991.

From model coefficient (Table 4), it is obvious that regression coefficient of predictor weight is highly significant as p < 0.001. Also R2 is very close to 1. This concludes that as the weight of the patient increases, blood pressure increases. That is, weight is under control, the blood pressure is normal or under normal. In this example, weight ranges from 69 to 80, the blood pressure ranges from 179 to 190. However, weight ranges from 40 to 54, the blood pressure ranges from 105 to 130. Again, when the weight ranges from 69 to 80, the blood pressure ranges from 168 to 190. Thus, higher the weight, higher is blood pressure.

Predictor	Estimate	SE	t	P
Intercept	17.03763	3.3607	5.07	<0.001
β	2.196106	0.0567	38.70	<0.001

Table 4.

Model coefficient.

3.3 Regression coefficient using R studio package

>y = c(125,123,120,181,105,190,185,118,175,168,110,130,128,116,179)

>x = c(50,48,46,73,40,80,75,45,74,69,43,54,51,44,72)

>result = data.frame(y,x)

> z = lm(y ∼ x,result)

> summary(z)

4. Forecasting or predicted value of Y

Using the fitted model (24), we can easily forecast the blood pressure level corresponding to its weight. Suppose, weight of a patient is 82 kg then its blood pressure will be obtained from Ŷ= 17.03763 + 2.196106×82 = 197.1183.

Residuals: We can obtain the residuals of all the 15 patients using (Y – Ŷ). This value is shown in Table 3.

R Squares: We can compute R2 using the residual sum of squares from the expression

R2=1−Residualsumof squarestotalsumof squares ofyE25

Where, residual sum of squares = ∑Y−Ŷ2 = 119.5140309 and is shown in Table 3.

Total sum of squares of y = ∑i=1nyi2−∑i=1nyi2n = 322,919 – 2153215 = 13891.73333.

So,R2=1−119.514030913891.73333=0.9914.

We can also obtain Regression coefficients using the Matrix form as following:

X′=111111111111111504846734080754574694354514472

X′X=1586486452622,X′Y=2153130284andX′X−1=1.22851006−0.0201708923−0.02017089230.0003501891

β̂=X′X−1X′Y=17.0376342.196106.β0̂=17.037634andβ1̂=2.196106.

4.1 R studio program to obtain the estimate of regression coefficients

>X = matrix(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,50,48,46,73,40,80,75,45,74,69,43,54,51,44,72),ncol = 2)

>X′= t(X)

>X′X = X′%∗%X

>Y = c(125,123,120,181,105,190,185,118,175,168,110,130,128,116,179)

>m = solve(X′X)

>X′Y= X′%*%Y

>n=X′Y

>β=m%∗%n

5. Regression lines

Let the regression model is denoted by Y = a + bX, then the two types of regression lines are following:

Regression line of Y on X and is denoted by bYX
Regression line of X on Y and is denoted by bXY

bYX=n∑i=1nyixi−∑i=1nyi∑i=1nxin∑i=1nxi2−∑i=1nxi2=2.196106E26

Regression line of X on Y and is denoted bybXY=n∑i=1nyixi−∑i=1nyi∑i=1nxin∑i=1nyi2−∑i=1nyi2=0.4514E27

6. Expectation and variance of estimators

Eβ0̂=β0andEβ1̂=β1E28

This show that least squares estimators β0̂ and β1̂ are unbiased estimators of β0 and β1, respectively.

Vβ0̂=σ21n+x¯2sumof squares ofxE29

AndVβ1̂=X'X−1σ2=σ2sumof squares ofx.E30

Where, σ2 is unknown. It is estimated from the given data as

σ2̂ = (yi−yî2n−2), and sum of squares of x is obtained as

Sum of squares of x = ∑i=1nxi2−∑i=1nxi2n; where n is number of observations.

Thus, when σ2 is unknown, V(β0̂) and V(β1̂) is determined as

Vβ0̂=σ2̂1n+x¯2sumof squares ofxandE31

Vβ1̂=σ2̂sumof squares ofx.E32

Again, when σ2 is unknown, standard error of β0̂ and β1̂ are determined as

SEβ0̂=Vβ0̂=σ2̂1n+x¯2sumof squares ofxandE33

SEβ1̂=Vβ1̂=σ2̂sumof squares ofx.E34

Example 3: Consider the example 2, compute the estimate of V(β0̂), V(β1̂) and its standard error.

Solution: From example 2, we have β0̂=17.037634and̂β1̂ = 2.196106. Sum of squares of X = ∑i=1nxi2−∑i=1nxi2n = 52,622 − 864215 = 52,622 – 49766.4 = 2855.6.

σ2̂=∑i=1n(yi−yî2)n−2=119.514013=9.1934.

x¯=∑i=1nxin=86415=57.6.

Estimates of the V(β0̂) and V(β1̂) can be determined from

Vβ0̂=σ2̂1n+x¯2sumof squares ofx=9.1934115+57.622855.6E35

So,Vβ0̂=11.2941844.

Vβ0̂=σ2̂sumof squares ofx==9.19342855.6=0.003219428E36

SEβ0̂=Vβ0̂=11.2941844=3.360682133,

SEβ1̂=Vβ1̂=0.003219428=0.056740004.

7. Testing of hypothesis of estimated regression coefficients β0̂ and β1̂

For testing the hypothesis that the sample comes from the population for which the value of β0 is equal to 0. That is,

Our Null hypothesis H0:β0=0

Against the alternate hypothesis H1:β0≠0.

Under the null hypothesis, for testing H0:β0=0, we define the test statistics given by

t=β0̂−Eβ0̂SEβ0̂=β0̂−β0SEβ0̂=β0̂−0SEβ0̂E37

=β0̂σ2̂1n+x¯2sumof squares ofx,whereσ2is unknown.E38

Similarly, for testing the hypothesis that the sample comes from the population for which the value of β1 is equal to 0. That is,

Our Null hypothesis H0:β1=0.

Against the alternate hypothesis H1:β1≠0.

Thus, Under the null hypothesis, for testing H0:β1=0, we define the test statistics given by

t=β1̂−Eβ1̂SEβ1̂=β1̂−β1SEβ1̂=β1̂−0SEβ1̂=β1̂σ2̂sumof squares ofxE39

where σ2 is unknown, and statistics t follows student’s t distribution with (n – 2) degrees of freedom.

Example 4: Consider the data given in example 2. Find the effect of weight on blood pressure of 15 patients, test the null hypothesis for testing the significance of the regression coefficients β0̂ and β1̂ at 5% level of significance.

Solution: From example 2, we have,

β0̂=17.037634,SEβ0̂=3.360682133,and

β1̂=2.196106,SEβ1̂=0.056740004.

Under the null hypothesis, for testing H0:β0=0, we have,

t=β0̂−Eβ0̂SEβ0̂=β0̂−β0SEβ0̂=β0̂−0SEβ0̂E40

where σ2 is unknown.

t=17.0376343.360682133=5.069695.

Under the null hypothesis, for testing H0:β1=0, we have

t=β1̂−Eβ1̂SEβ1̂=β1̂−β1SEβ1̂=β1̂−0SEβ1̂,whereσ2is unknown.E41

t=2.1961060.056740004=38.70472.

With α = 0.05, for two sided α, the tabulated value of t at 13 degrees of freedom with 5% level of significance is 2.160. In case of H0:β0=0, the calculated value of t (=5.069695) is greater than tabulated value of t, so test is significant. That is, we reject the null hypothesis. Hence, we may conclude that the value of β0 is not equal to zero.

Similarly, In case of H0:β1=0, the calculated value of t (=38.70472) is greater than tabulated value of t, so test is highly significant. That is, we reject the null hypothesis. Hence, we may conclude that the value of β1 is not equal to zero. Thus, the fitted simple regression model is highly significant. In other word, we can say as the weight of the patient increases, the chance of blood pressure may increase.

Alternatively, we can also test the significance of regression coefficient β1 using the analysis of variance Table 5.

Sources of variation	Degrees of freedom	Sum of squares	Mean squares	F-Ratio
Regression	1	∑yî−y¯2	SSR/df = MSR	MSR/MSE
Error	(n − 2)	∑yi−yî2	SSE/df = MSE
Total	(n – 1)	∑yi−y¯2

Table 5.

Analysis of variance.

Example 5: Consider the data given in example 2. Find the effect of weight on blood pressure of 15 patients, test the null hypothesis for testing the significance of the regression coefficients β1̂ at 5% level of significance using analysis of variance Table 5.

Using Table 3, we have Table 6.

y	x	Predicted value of y (yî)	∑yî−y¯2	∑yi−yî2	∑yi−y¯2
125	50	126.8429	278.568451	3.396390985	343.4832089
123	48	122.4507	444.475264	0.301710716	421.6164089
120	46	118.0585	648.965129	3.769398952	553.8162089
181	73	177.3534	1143.797	13.29792494	1403.753609
105	40	104.8819	1493.93304	0.013954697	1484.815209
190	80	192.7261	2419.93256	7.431675732	2159.154209
185	75	181.7456	1460.17834	10.59124954	1719.487209
118	45	115.8624	765.678707	4.56933376	651.9494089
175	74	179.5495	1297.16479	20.69771368	990.1532089
168	69	168.5689	626.78347	0.323697275	598.6194089
110	43	111.4702	1028.04315	2.161452755	1124.482209
130	54	135.6274	62.5039822	31.66711304	183.1502089
128	51	129.039	210.083689	1.079595809	241.2834089
116	44	113.6663	892.038047	5.446183694	758.0826089
179	72	175.1573	1000.07497	14.76663534	1257.886809
2153	864	2153	13772.2206	119.5140309	13891.73333

Table 6.

Regression, error and total sum of squares.

Using Table 6, we can obtain the ANOVA table as shown in Table 7.

Sources of variation	Degrees of freedom	Sum of squares	Mean squares	F-Ratio	P-value
Regression	1	13772.2206	13772.2206	1498.057	<0.001
Error	13	119.5140309	9.193387
Total	14	13891.73333

Table 7.

Analysis of variance.

From Table 7, we can observe the p value for regression coefficient is less than 0.001 as well calculated value of F is very large. Which is greater than tabulated value of F with (1, 13) degrees of freedom at 5% level of significance, where tabulated value of F is 4.67. This shows that the test is significant and hence rejects the null hypothesis.

7.1 R studio program for obtaining ANOVA table

>y = c(125,123,120,181,105,190,185,118,175,168,110,130,128,116,179)

>x = c(50,48,46,73,40,80,75,45,74,69,43,54,51,44,72)

>result = data.frame(y,x)

> av. = aov(y ∼ x,result)

> summary(av)

8. Confidence interval of estimated regression coefficients

Now we discuss how to obtain 100% confidence interval of estimated regression coefficients. In fact we are interested for confidence interval for regression coefficient β1 only. In fact we generally compute the confidence interval to determine the range. In case of regression coefficient β1, we wish to determine the lower and upper limit of the β1. We can obtain the 100% confidence interval of the β1as

Confidence interval for β1= β1̂ ± tn−2,α SE(β1̂) for two tailed test.

Example 6: Consider the data given in example 2. Find the effect of weight on blood pressure of 15 patients. Obtain 100% confidence interval of regression coefficients β1̂.

Solution: For this data from example 2, we have,

β1̂=2.196106,SEβ1̂=0.056740004andtn−2,α=2.160.

Confidence interval forβ1=2.196106±2.160×0.056740004=2.196106±0.122558409

So, lower confidence limit of β1 is 2.196106 − 0.122558409 = 2.073547489.

Upper confidence limit of β1 is 2.196106 + 0.122558409 = 2.318664306.

Thus, the confidence limit of β1 is ranges from 2.073547489 to 2.318664306.

If one is interested for determining the confidence limit of β0, the same procedure can be used.

9. Logistic regression

Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regressions with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest

We can also say that Logistic regression is used for predicting binary outcomes on the basis of one or more predictor variables. The concept of logistic regression is similar to the ordinary multiple linear regression. In this method we are willing to fit a best model which can determine the relationship between a response variable and one or more explanatory variables. As in the case of ordinary linear regression, the form of the model is linear with respect to the regression parameters, the same is true for the logistic regression. The only difference between the two regression is following: in logistic regression the response variable is binary (also called dichotomous), whereas in ordinary linear regression it is continuous. Logistic regression can also be called as a predictive algorithm in which by using explanatory variables one can predict the dependent variable, just like Linear regression, but with a simple difference that the dependent variable in the logistic regression should be considered as a categorical variable.

10. Logistic function

For understanding the logistic regression, first we must determine logistics function.

Let us consider the equation of the best fit model in the simple linear regression as

y=β0+β1xE42

where, y is response variable and x is explanatory variable.

Let us replace y by probability P which is given as

P=β0+β1xE43

In (43) value of P may be negative in some case and in some other cases, value of P may be more than one. However, value of P ranges from 0 to 1 only. This is a contradiction. To overcome this problem, we can take odds of P instead of probability. Odds of probability is defined as odd = P1−P. That is, odds of probability is defined as the ratio of the probability of success and the probability of failure.

So, from (43), we have

P1−P=β0+β1x.E44

As we are aware that odds will be always positive, that is odds are ranging from 0 to infinity. Again, to overcome this problem we take the log transformation because by considering log transformation it will range from − infinity to + infinity.

LogP1−P=β0+β1xE45

On taking exponential on both sides of (45), we have

ExpLogP1−P=expβ0+β1xE46

That is, P1−Plog(e) = eβ0+β1x

P1−P=eβ0+β1xE47

P=1–peβ0+β1x

P+Peβ0+β1x=eβ0+β1x

P1+eβ0+β1x=eβ0+β1x

So,P=eβ0+β1x1+eβ0+β1xE48

On dividing the numerator and denominator of (48) by eβ0+β1x, we have

P=eβ0+β1x/eβ0+β1x1+eβ0+β1x/eβ0+β1x=11+eβ0+β1x/eβ0+β1xE49

Thus,P=11+e−β0+β1xE50

We can call (50) as a logistic function.

If we consider only one explanatory variable then the graph of simple linear regression will give a straight line however, the graph of logistic regression will be of S shape. This is shown in Figure 2.

Figure 2.
Shape of linear and logistic regression.

Logistic regression can also be called as a predictive algorithm in which by using explanatory variables one can predict the dependent variable, just like Linear Regression, but with a simple difference that the dependent variable in the logistic regression should be considered as a categorical variable.

Example 7: A random sample of 300 women were selected, where 300 women are either suffering with cancer or not. The response of yes will be asked from the 300 women. It is found that 225 women responded yes. The response of yes out of n sample number follows binomial distribution with parameters n and p. Obtain the odds ratio.

Solution: Out of 300 women 225 have responded yes for cancer. So, the sample proportion is

P̂=225300=0.75.

Sample proportion cannot be used for finding logistic regression and hence, we need the odds. Where odds is the ratio of proportion for two outcomes. One outcomes is “yes” and the other outcomes is “no”. Proportion of yes is 0.75, hence, proportion of no is 1 − P̂ = 1 − 0.75 = 0.25.

Oddratio ofyesandnoof women cancer=P̂1−P̂=0.750.25=3.

Hence, odds are 3 to 1 that woman has cancer yes to no. Similarly, we can also say odds are 1 to 3, that is, women has cancer no to yes is 1 to 3.

Example 8: The sample proportion of women who were detected as cancer patient is 65%, whereas the sample proportion of men detected as cancer patient is 45%.

In this sample of young adult, it can be observe that the sample proportion of women detected as cancer patient is 20% higher than the sample proportion of men detected as cancer. Now we wish to analyze this data using logistic regression. In this example the predictive variable is sex which is a categorical variable. So we need to use a numeric code. The better way is to use a indicator saying whether the adult is women or not. The indicator function is defined as

x=1if the person is women0,if the person ismenE51

Since, the response is given in proportion, so we transform it into odds. There will be two odds, one for women and other for men.

Odds for womenaregivenasP̂1−P̂=0.651−0.65=0.650.35=1.8571.

Similarly,odds formenis givenasP̂1−P̂=0.451−0.45=0.450.55=0.8182.

Now we can build the logistic regression model by considering log(odds) as the linear function of the explanatory variable. Hence, logistics model is defined as

logP̂1−P̂=β0+β1x,E52

where x is explanatory variable, p is the binomial proportion and β0, β1are the parameters of the logistic regression model.

Here, there are only two values of x and hence write two equations: one for women and other for men.

For women,logP̂1−P̂=β0+β1×1E53

And formen,logP̂1−P̂=β0+β1×0E54

Because, there is a β1in the equation of women as x = 1. This is missing in the equation of men as x = 0.

Therefore, the logistic regression model for women and men are following:

Log1.8571=β0+β1E55

Log0.8182=β0

β0+β1=0.6190E56

β0=−0.20065E57

On solving (24) and (25), we have

−0.20065+β1=0.6190E58

Hence, β1 = 0.6190 + 0.20065 = 0.81965.

Now the fitted logistic regression model is given by

Logoddswomen=β0+β1E59

So,oddswomen=eβ0+β1E60

Similarly,oddsmen=eβ0E61

oddswomenoddsmen=eβ0+β1eβ0=eβ1=e0.81965=2.269705.

oddswomen=2.269705×oddsmen.

That is, we can say that odds of women are 2.269705 times odds of men.

Note that, if we have indicator function as

x=0if the person is women1,if the person ismenE62

Then the sign of β1will be negative. That is,

Logoddswomen=β0.So,oddswomen=eβ0E63

Similarly,oddsmen=eβ0+β1E64

oddswomenoddsmen=eβ0eβ0+β1=e−β1=e−0.81965=0.440586

oddswomen=0.440586×oddsmen.E65

Therefore, we can say odds of women are 0.440586 times odds of men.

Example 9: Hemoglobin contain of 20 patients corresponding to their age was collected at a hospital to know the relationship between hemoglobin and age. The collected observations are shown in Table 8.

Hb(g/Dl)	Age (Year)	Anemic(1 = yes,0 = no)
11.2	15	1
11.3	21	1
11.5	23	1
16.3	25	0
16.5	26	0
10.1	28	1
9.9	30	1
17.1	32	0
17.2	34	0
17.9	36	0
10.1	38	1
11.6	40	1
18.3	43	0
18.6	46	0
18.9	54	0
19.2	56	0
19.6	58	0
19.9	60	0
16.9	62	0
17.2	69	0

Table 8.

Level of hemoglobin and its corresponding age.

If we use a simple linear regression to find the effect of age on the response variable Hemoglobin, we obtain the following statistics using software (Table 9).

Predictor	Estimate	SE	t	P
Intercept	9.382	1.7875	5.25	<0.001
Age	0.153	0.0420	3.64	0.002

Table 9.

Model coefficient and 95% confidence.

Here regression coefficient is significant. That is, there is significant effect of age on hemoglobin.

As we are aware that the amount of hemoglobin in whole blood is expressed in grams per deciliter (g/dl). The normal Hb level for males is 14 to 18 g/dl; and for females is 12 to 16 g/dl. When the hemoglobin level is low, the patient has anemia.

If we are interested to know whether the patient is suffering with anemia then we have to use the logistic regression method. For this we have to transform the Hemoglobin data into presence or absence of Anemia. Since, the data belongs to women patients, so if the value is less than 12, the code is 1, that is women has Anemia, while the value is more than 12, the code is 0 (no Anemia). This is shown in column 3 of the Table 8. Now we fit a logistic regression between presence/absence of anemia and actual age using the software Jamovi. The following statistics is obtained (Table 10).

Predictor	Estimate	SE	t(Z)	Odds ratio	P	95% confidence interval
Predictor	Estimate	SE	t(Z)	Odds ratio	P	Lower	Upper
Intercept	3.993	2.1239	1.88	54.216	0.060	0.844	3483.577
Age	−0.130	0.0629	−2.06	0.878	0.039	0.776	0.994

Table 10.

Model coefficients and 95% confidence intervals.

Since p value of regression coefficient (Age) is less than 0.05 and hence test is significant. That is, as age increases, chance of Anemia decreases.

11. Testing of hypothesis and confidence intervals of logistic regression coefficient

For testing the hypothesis that the sample comes from the population for which the value of logistic regression coefficient β1 is equal to 0. That is, Our Null hypothesis H0:β1=0 against the alternate hypothesis H1:β1≠0.

Under the null hypothesis, for testing H0:β1=0, we define the test statistics given by

Z=β1̂−Eβ1̂SEβ1̂=β1̂−β1SEβ1̂=β1̂−0SE(β1̂=β1̂SE(β1)̂=−.1300.0629=−2.06.E66

We can obtain the 95% confidence interval of the logistic regression coefficient β1as

Confidence interval for β1= eβ1̂±ZSEβ1̂. That is, lower limit = eβ1̂−ZSEβ1̂ and upper limit = eβ1̂+ZSEβ1̂.

For Example 9, lower limit = e−0.130–1.96×.0629 = 0.776 and upper limit = e−0.130+1.96×.0629= 0.993.

12. Conclusions

The main objective of this chapter is to discuss about simple linear regression and logistic regression analysis. Here, we have explained how to estimate regression coefficients, its standard error, testing of hypothesis of regression coefficients, 95% confidence intervals for simple linear regression and logistic regression model. All the statistics calculated manually is verified using R studio package and Jamovi package.

Acknowledgments

Author is thankful to the management committee members of the IntechOpen, the referee and the editor of this edited book for providing me an opportunity to share my works.

References

1. Montogomery DC, Peck EA, Vining GG. Introduction to Linear Regression Analysis. Wiley. © 2019-2021 Pluripotent Limited
2. Rencher AC, Bruce Schaalje G. Linear Models in Statistics. New Jersey: John Wiley; 2008
3. Swaminathan S. Regression Detailed View. Published in Towards Data Science; 2018. Available from: https://scholar.google.com.vn/citations?view_op=view_citation&hl=vi&user=K8vtbzAAAAAJ&citation_for_view=K8vtbzAAAAAJ:d1gkVwhDpl0C
4. Lane DM. Introduction to Linear Regression, Chapter 14 Regression. Available from: https://onlinestatbook.com/2/regression/intro.html
5. Noce AA, McKeown L. A new benchmark for internet use: A logistic modeling of factors influencing internet use in Canada, 2005. Government Information Quarterly. 2008;25:462-476
6. Seo D-C et al. Relations between physical activity and behavioral and perceptual correlates among Midwestern college students. Journal of American College Health. 2007;56:187-197

[1] 1. Montogomery DC, Peck EA, Vining GG. Introduction to Linear Regression Analysis. Wiley. © 2019-2021 Pluripotent Limited

[2] 2. Rencher AC, Bruce Schaalje G. Linear Models in Statistics. New Jersey: John Wiley; 2008

[3] 3. Swaminathan S. Regression Detailed View. Published in Towards Data Science; 2018. Available from: https://scholar.google.com.vn/citations?view_op=view_citation&hl=vi&user=K8vtbzAAAAAJ&citation_for_view=K8vtbzAAAAAJ:d1gkVwhDpl0C

[4] 4. Lane DM. Introduction to Linear Regression, Chapter 14 Regression. Available from: https://onlinestatbook.com/2/regression/intro.html

[5] 5. Noce AA, McKeown L. A new benchmark for internet use: A logistic modeling of factors influencing internet use in Canada, 2005. Government Information Quarterly. 2008;25:462-476

[6] 6. Seo D-C et al. Relations between physical activity and behavioral and perceptual correlates among Midwestern college students. Journal of American College Health. 2007;56:187-197

Perspective Chapter: Linear Regression and Logistic Regression Models

Recent Advances in Biostatistics

Abstract

Keywords

Author Information

Dilip Kumar Ghosh*

1. Introduction

2. Simple linear regression

3. Scatter diagram

Table 1.

Figure 1.

3.1 Assumptions underlying linear regression model

3.2 Estimation of parameters

Table 2.

Table 3.

Table 4.

3.3 Regression coefficient using R studio package

4. Forecasting or predicted value of Y

4.1 R studio program to obtain the estimate of regression coefficients

5. Regression lines

6. Expectation and variance of estimators

7. Testing of hypothesis of estimated regression coefficients β0̂ and β1̂

Table 5.

Table 6.

Table 7.

7.1 R studio program for obtaining ANOVA table

8. Confidence interval of estimated regression coefficients

9. Logistic regression

10. Logistic function

Figure 2.

Table 8.

Table 9.

Table 10.

11. Testing of hypothesis and confidence intervals of logistic regression coefficient

12. Conclusions

Acknowledgments

References

QoLMiss: Package for Repeatedly Measured Quality of Life of Cancer Patients Data

Your cart

S. no.	Blood sugar level	Weight of the patients (in kg)
1	146	50
2	145	48
3	141	46
4	168	69
5	132	40
6	190	80
7	180	70
8	130	38
9	181	75
10	148	59
11	110	30
12	147	54
13	146	51
14	120	35
15	155	65

S. no.	Blood pressure level (mm/hg)	Weight (in kg)
1	125	50
2	123	48
3	120	46
4	181	73
5	105	40
6	190	80
7	185	75
8	118	45
9	175	74
10	168	69
11	110	43
12	130	54
13	128	51
14	116	44
15	179	72

S. no.	Blood sugar level	Weight of the patients (in kg)
1	146	50
2	145	48
3	141	46
4	168	69
5	132	40
6	190	80
7	180	70
8	130	38
9	181	75
10	148	59
11	110	30
12	147	54
13	146	51
14	120	35
15	155	65

S. no.	Blood pressure level (mm/hg)	Weight (in kg)
1	125	50
2	123	48
3	120	46
4	181	73
5	105	40
6	190	80
7	185	75
8	118	45
9	175	74
10	168	69
11	110	43
12	130	54
13	128	51
14	116	44
15	179	72

Perspective Chapter: Linear Regression and Logistic Regression Models

Recent Advances in Biostatistics

Abstract

Keywords

Author Information

Dilip Kumar Ghosh*

1. Introduction

2. Simple linear regression

3. Scatter diagram

Table 1.

Figure 1.

3.1 Assumptions underlying linear regression model

3.2 Estimation of parameters

Table 2.

Table 3.

Table 4.

3.3 Regression coefficient using R studio package

4. Forecasting or predicted value of Y

4.1 R studio program to obtain the estimate of regression coefficients

5. Regression lines

6. Expectation and variance of estimators

7. Testing of hypothesis of estimated regression coefficients β0̂ and β1̂

Table 5.

Table 6.

Table 7.

7.1 R studio program for obtaining ANOVA table

8. Confidence interval of estimated regression coefficients

9. Logistic regression

10. Logistic function

Figure 2.

Table 8.

Table 9.

Table 10.

11. Testing of hypothesis and confidence intervals of logistic regression coefficient

12. Conclusions

Acknowledgments

References

Continue reading from the same book

Recent Advances in Biostatistics

Your cart

S. no.	Blood sugar level	Weight of the patients (in kg)
1	146	50
2	145	48
3	141	46
4	168	69
5	132	40
6	190	80
7	180	70
8	130	38
9	181	75
10	148	59
11	110	30
12	147	54
13	146	51
14	120	35
15	155	65

S. no.	Blood pressure level (mm/hg)	Weight (in kg)
1	125	50
2	123	48
3	120	46
4	181	73
5	105	40
6	190	80
7	185	75
8	118	45
9	175	74
10	168	69
11	110	43
12	130	54
13	128	51
14	116	44
15	179	72