Statistic on aiR: Simple linear regression

We use the regression analysis when, from the data sample, we want to derive a statistical model that predicts the values of a variable (Y, dependent) from the values of another variable (X, independent). The linear regression, which is the simplest and most frequent relationship between two quantitative variables, can be positive (when X increase, Y increase too) or negative (when X increase, Y decrease): this is indicated by the sign of the coefficient b.

To build the line that describes the distribution of points, we might refer to different principles. The most common is the least squares method (or Model I), and this is the method used by the statistical software R.

Suppose you want to obtain a linear relationship between weight (kg) and height (cm) of 10 subjects.

Height: 175, 168, 170, 171, 169, 165, 165, 160, 180, 186
Weight: 80, 68, 72, 75, 70, 65, 62, 60, 85, 90

The first problem is to decide what is the dependent variable Y and waht is the independent variable X. In general, the independent variable is not affected by an error during the measurement (or affected by random error), while the dependent variable is affected by error. In our case we can assume that the variable weight is the independent variable (X), and the dependent variable height (Y).
So our problem is to find a linear relationship (formula) that allows us to calculate the height, known as the weight of an individual. The simplest formula is that of a broad line of type Y = a + bX. The simple regression line in R is calculated as follows:


height = c(175, 168, 170, 171, 169, 165, 165, 160, 180, 186)
weight = c(80, 68, 72, 75, 70, 65, 62, 60, 85, 90)
 
model = lm(formula = height ~ weight, x=TRUE, y=TRUE)
model

Call:
lm(formula = height ~ weight, x = TRUE, y = TRUE)

Coefficients:
(Intercept)       weight  
   115.2002       0.7662

The correct syntax of the formula stated in lm is: Y ~ X, then you declare first the dependent variable, and after the independent variable (or variables).
The output of the function is represented by two parameters a and b: a=115.2002 (intercept), b=0.7662 (the slope).

The simple calculation of the line is not enough. We must assess the significance of the line, ie if the slopeb differs from zero significantly. This may be done with a Student's t.test or with a Fisher's F-test.
In R both can be retrieved very quickly, with the function summary(). Here's how:


model <- lm(height ~ weight)
summary(model)

Call:
lm(formula = height ~ weight)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6622 -0.9683 -0.1622  0.5679  2.2979 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 115.20021    3.48450   33.06 7.64e-10 ***
weight        0.76616    0.04754   16.12 2.21e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.405 on 8 degrees of freedom
Multiple R-squared: 0.9701,     Adjusted R-squared: 0.9664 
F-statistic: 259.7 on 1 and 8 DF,  p-value: 2.206e-07

Here too there are the values of the parameters a and b.
The Student's t-test on the slope in this case has the value 16.12; the Student's t-test on the intercept has value 16.12; the value of the Fisher's F test is 259.7 (is the same value would be achieved by performing an ANOVA on the same data: anova(model)). The p-values of the t-tests and the F-test are less then 0.05, so the model we found is significant.
The Multiple R-squared is the coefficient of determination. It provides a measure of how well future outcomes are likely to be predicted by the model. In this case, the 97.01% of the data are well predicted (with 95% of significance) by our model.

We can plot on a graph the data points and the regression line, in this way:


plot(weight, height)
abline(model)

2 comments:

Will DwinnellAugust 7, 2009 at 12:11 PM
"The simple calculation of the line is not enough. We must assess the significance of the line, ie if the slopeb differs from zero significantly. This may be done with a Student's t.test or with a Fisher's F-test."

It's worth noting that empirical models (linear or not) can also be evaluated by executing the model on out-of-sample data (data not used in the construction of the model), and measuring the quality of the fit.
Todos LogosAugust 7, 2009 at 4:02 PM
@Will Dwinnell
thanks for the suggestion. Can be useful to calculate also the confidence interval for a prediction value. Suppose you want to predict the height, when weight=83; we can proceed in this way:
predict(model, data.frame(weight=83), interval="confidence", level=.95)

Statistic on aiR

Thursday, August 6, 2009

Simple linear regression

2 comments:

Google Ads

Tag Cloud

Blog's Info & Utilities

Contact me at:

Blog Archive

Last Comments