Thursday, August 20, 2009

Simple logistic regression on qualitative dichotomic variables

In this post we will see briefly how to implement a logistic regression model if you have categorical variables, or qualitative, organized in double entry contingency tables. In this model the dependent variable (Y) and independent variable (X) are both dichotomies (or Bernoullian).

In general, the probability that Y = 1 as a function of predictors is:

$$P(Y=1|X=x)=\pi(x)=\frac{exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}{1+exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}$$

Our goal is to estimate the value of the beta parameters (regressors).

We begin to examine a model of simple logistic regression (with only one predictor).

Consider the following example. The table below shows the results of a study on gastroesophageal reflux. You want to evaluate how the presence of a stress factor can influence the onset of this disease.



First we import the values in R. We must create a table with double entry; proceed as follows:


reflux <- matrix(c(251,131,4,33), nrow=2)
colnames(reflux) <- c("reflNO", "reflYES")
rownames(reflux) <- c("stressNO", "stressYES")
table <- as.table(reflux)

table

reflNO reflYES
stressNO 251 4
stressYES 131 33


Now adjust the data for the logistic regression. We must create a data frame:


dft <- as.data.frame(table)
dft

Var1 Var2 Freq
1 stressNO reflNO 251
2 stressYES reflNO 131
3 stressNO reflYES 4
4 stressYES reflYES 33



We can now fit the model, and then perform the logistic regression in R:


fit <- glm(Var2 ~ Var1, weights = Freq, data = dft, family = binomial(logit))
summary(fit)


Call:
glm(formula = Var2 ~ Var1, family = binomial(logit), data = dft,
weights = Freq)

Deviance Residuals:
1 2 3 4
-2.817 -7.672 5.765 10.287

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.1392 0.5040 -8.213 < 2e-16 ***
Var1stressYES 2.7605 0.5403 5.109 3.23e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 250.23 on 3 degrees of freedom
Residual deviance: 205.86 on 2 degrees of freedom
AIC: 209.86

Number of Fisher Scoring iterations: 6



First we comment the code to perform the regression. The logistic regression is called imposing the family: family = binomial(logit). The code Var2 ~ Var1 means that we want to create a model that will explain the variable var2 (presence or absence of reflux) as a function of the variable var1 (presence or absence of stressful events). In practice var2 is the independent variable Y, and Var1 is the dependent variable X (the regressors). Provided the formula to be analyzed, you specify the weight of each variable, data in column Freq of the dataframe dft (so we write weights = Freq and data = dft to specify the location where the values are contained).

The values of the parameters $\beta_0$ and $\beta_1$ are respectively the values (intercept) and Var1stress1. We can then write our empirical model:

$$\pi(x)=\frac{exp(-4.139+2.760x)}{1+exp(-4.139+2.760x)}$$

The independent variable x can be zero or one. If you assume value 0 (ie in the absence of stressful events), then the probability of having reflux is:

$$\pi(x=0)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=0.016=1.6\%$$

If there are stressful events (x = 1), the probability of having reflux is:

$$\pi(x=1)=\frac{exp(\beta_0+\beta_1)}{1+exp(\beta_0+\beta_1)}=0.20=20\%$$

The odds are:

$$odds(x=1)=\frac{\pi(1)}{1-\pi(1)}=exp(\beta_0+\beta_1)$$

$$odds(x=0)=\frac{\pi(0)}{1-\pi(0)}=exp(\beta_0)$$

We can finally calculate the odd ratio OR:

$$OR=\frac{odds(x=1)}{odds(x=0)}=15.807$$

A person who has experienced a stressful event has a propensity to develop gastroesophageal reflux 15.807 times larger than the person who has not undergone stressful events.

The probabilities and the odds can be readily calculated in R recalling that:

fit$coefficient[1] = $\beta_0$ (intercept)
fit$coefficient[2] = $\beta_1$

Furthermore:

summary(fit)$coefficient[1,2] = standard error of $\beta_0$
summary(fit)$coefficient[2,2] = standard error of $\beta_1$

And so we have:


pi0 <- exp(fit$coefficient[1]) / (1 + exp(fit$coefficient[1]))
pi1 <- exp(fit$coefficient[1] + fit$coefficient[2]) / (1 + exp(fit$coefficient[1]+fit$coefficient[2]))

odds0 <- pi0 / (1 - pi0)
odds1 <- pi1 / (1 - pi1)

OR <- odds1 / odds0

#the same result with:
OR <- exp(fit$coefficient[2])

#the confidence interval for OR is:
ORmin <- exp( fit$coefficient[2] - qnorm(.975) * summary(fit)$coefficient[2,2] )

ORmax <- exp( fit$coefficient[2] + qnorm(.975) * summary(fit)$coefficient[2,2] )


We can obtain the same result for the odd-ratio, using the simplify formula:

$$OR=\frac{ad}{bc}=\frac{251\cdot33}{4\cdot131}=15.807$$

that in R is:


OR <- (table[1,1]*table[2,2]) / (table[1,2]*table[2,1])


The acronym AIC stands for Akaike's information criterion. This parameter does not provide any data on the model just created. It
may be useful in comparing this model with other possibly taken into account (the model with lowest AIC is the better).

No comments:

Post a Comment