Showing posts with label T-test. Show all posts
Showing posts with label T-test. Show all posts

Thursday, August 6, 2009

Simple linear regression

We use the regression analysis when, from the data sample, we want to derive a statistical model that predicts the values of a variable (Y, dependent) from the values of another variable (X, independent). The linear regression, which is the simplest and most frequent relationship between two quantitative variables, can be positive (when X increase, Y increase too) or negative (when X increase, Y decrease): this is indicated by the sign of the coefficient b.

To build the line that describes the distribution of points, we might refer to different principles. The most common is the least squares method (or Model I), and this is the method used by the statistical software R.

Suppose you want to obtain a linear relationship between weight (kg) and height (cm) of 10 subjects.
Height: 175, 168, 170, 171, 169, 165, 165, 160, 180, 186
Weight: 80, 68, 72, 75, 70, 65, 62, 60, 85, 90


The first problem is to decide what is the dependent variable Y and waht is the independent variable X. In general, the independent variable is not affected by an error during the measurement (or affected by random error), while the dependent variable is affected by error. In our case we can assume that the variable weight is the independent variable (X), and the dependent variable height (Y).
So our problem is to find a linear relationship (formula) that allows us to calculate the height, known as the weight of an individual. The simplest formula is that of a broad line of type Y = a + bX. The simple regression line in R is calculated as follows:


height = c(175, 168, 170, 171, 169, 165, 165, 160, 180, 186)
weight = c(80, 68, 72, 75, 70, 65, 62, 60, 85, 90)

model = lm(formula = height ~ weight, x=TRUE, y=TRUE)
model

Call:
lm(formula = height ~ weight, x = TRUE, y = TRUE)

Coefficients:
(Intercept) weight
115.2002 0.7662


The correct syntax of the formula stated in lm is: Y ~ X, then you declare first the dependent variable, and after the independent variable (or variables).
The output of the function is represented by two parameters a and b: a=115.2002 (intercept), b=0.7662 (the slope).




The simple calculation of the line is not enough. We must assess the significance of the line, ie if the slopeb differs from zero significantly. This may be done with a Student's t.test or with a Fisher's F-test.
In R both can be retrieved very quickly, with the function summary(). Here's how:


model <- lm(height ~ weight)
summary(model)

Call:
lm(formula = height ~ weight)

Residuals:
Min 1Q Median 3Q Max
-1.6622 -0.9683 -0.1622 0.5679 2.2979

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 115.20021 3.48450 33.06 7.64e-10 ***
weight 0.76616 0.04754 16.12 2.21e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.405 on 8 degrees of freedom
Multiple R-squared: 0.9701, Adjusted R-squared: 0.9664
F-statistic: 259.7 on 1 and 8 DF, p-value: 2.206e-07



Here too there are the values of the parameters a and b.
The Student's t-test on the slope in this case has the value 16.12; the Student's t-test on the intercept has value 16.12; the value of the Fisher's F test is 259.7 (is the same value would be achieved by performing an ANOVA on the same data: anova(model)). The p-values of the t-tests and the F-test are less then 0.05, so the model we found is significant.
The Multiple R-squared is the coefficient of determination. It provides a measure of how well future outcomes are likely to be predicted by the model. In this case, the 97.01% of the data are well predicted (with 95% of significance) by our model.

We can plot on a graph the data points and the regression line, in this way:


plot(weight, height)
abline(model)

Monday, July 27, 2009

Paired Student's t-test

Comparison of the means of two sets of paired samples, taken from two populations with unknown variance.

A school athletics has taken a new instructor, and want to test the effectiveness of the new type of training proposed by comparing the average times of 10 runners in the 100 meters. Are below the time in seconds before and after training for each athlete.

Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
After training: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1


In this case we have two sets of paired samples, since the measurements were made on the same athletes before and after the workout. To see if there was an improvement, deterioration, or if the means of times have remained substantially the same (hypothesis H0), we need to make a Student's t-test for paired samples, proceeding in this way:



a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)

t.test(a,b, paired=TRUE)

Paired t-test

data: a and b
t = -0.2133, df = 9, p-value = 0.8358
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5802549 0.4802549
sample estimates:
mean of the differences
-0.05


The p-value is greater than 0.05, then we can accept the hypothesis H0 of equality of the averages. In conclusion, the new training has not made any significant improvement (or deterioration) to the team of athletes.
Similarly, we calculate the t-tabulated value:


qt(0.975, 9)
[1] 2.262157


t-computed < t-tabulated, so we accept the null hypothesis H0.




Suppose now that the manager of the team (given the results obtained) fired the coach who has not made any improvement, and take another, more promising. We report the times of athletes after the second training:

Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
After the second training: 12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0


Now we check if there was actually an improvement, ie perform a t-test for paired data, specifying in R to test the alternative hypothesis H1 of improvement in times. To do this simply add the syntax alt = "less" when you call the t-test:


a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0)

t.test(a,b, paired=TRUE, alt="less")

Paired t-test

data: a and b
t = 5.2671, df = 9, p-value = 0.9997
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 2.170325
sample estimates:
mean of the differences
1.61


With this syntax we asked R to check whether the mean of the values contained in the vector a is less of the mean of the values contained in the vector b. In response, we obtained a p-value well above 0.05, which leads us to conclude that we can reject the null hypothesis H0 in favor of the alternative hypothesis H1: the new training has made substantial improvements to the team.

If we had written: t.test (a, b, paired = TRUE, alt = "greater"), we asked R to check whether the mean of the values contained in the vector a is greater than the mean of the values contained in the vector b. In light of the previous result, we can suspect that the p-value will be much smaller than 0.05, and in fact:


a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0)

t.test(a,b, paired=TRUE, alt="greater")

Paired t-test

data: a and b
t = 5.2671, df = 9, p-value = 0.0002579
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
1.049675 Inf
sample estimates:
mean of the differences
1.61

Sunday, July 26, 2009

Two sample Student's t-test #2

Comparison of the averages of two independent groups, extracted from two populations at variance unknown; sample variances are not homogeneous.

We want to compare the heights in inches of two groups of individuals. Here the measurements:

A: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179
B: 120, 180, 125, 188, 130, 190, 110, 185, 112, 188


As we have seen in a previous exercise, we must first check whether the variances are homogeneous (homoskedasticity) with a F-test of Fisher:



a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b = c(120, 180, 125, 188, 130, 190, 110, 185, 112, 188)

var.test(b,a)

F test to compare two variances

data: b and a
F = 14.6431, num df = 9, denom df = 9, p-value = 0.0004636
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
3.637133 58.952936
sample estimates:
ratio of variances
14.64308


We obtained p-value less than 0.05, then the two variances are not homogeneous. Indeed we can compare the value of F computed with the tabulated value of F for alpha = 0.05, degrees of freedom at numerator = 9, and degrees of freedom of denominator = 9, using the function qf(p, df.num, df.den):


qf(0.95, 9, 9)
[1] 3.178893


F-computed is greater than F-tabulated, so we can reject the null hypothesis H0 of homogeneity of variances.

To make the comparison between the two groups, we use the function t.test with not homogeneous variances (var.equal = FALSE, which can also be omitted, because the function works on non-homogeneous variance by default) and independent samples (paired = FALSE, which can also be omitted, because by default the function works on independent samples) in this way:


t.test(a,b, var.equal=FALSE, paired=FALSE)

Welch Two Sample t-test

data: a and b
t = 1.8827, df = 10.224, p-value = 0.08848
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.95955 47.95955
sample estimates:
mean of x mean of y
174.8 152.8


As we see in the headline, you made a t-test on two samples with the calculation of degrees of freedom using the formula of Welch-Satterthwaite (the result of the formula is df = 10,224), which is used in cases where the variances are not homogeneous. Welch-Satterthwaite equation is also called Dixon-Massey formula when you make the comparison between two groups, as in this case.
We obtained p-value greater than 0.05, then we can conclude that the means of the two groups are significantly similar (albeit p-value is very close to the threshold 0.05). Indeed the value of t is less than the tabulated t-value for 10,224 degrees of freedom, which in R we can calculate:


qt(0.975, 10.224)
[1] 2.221539


We can accept the hypothesis H0 of equality of means.


Welch-Satterthwaite formula:

$$df=\frac{\sum deviance(X)}{\sum df(X)}=\frac{\sum_{i=1}^{k} (\sum_{j=1}^{n} (X_{ij} - \bar{X_i})^2}{\sum_{i=1}^{k}(n_i-1)}$$


Dixon-Massey formula:

$$df=\frac{\left(\frac{\displaystyle S_1^2}{\displaystyle n_1}+\frac{\displaystyle S_2^2}{\displaystyle n_2}\right)^2}{\frac{\displaystyle\left(\frac{\displaystyle S_1^2}{\displaystyle n_1}\right)^2}{\displaystyle n_1-1}+\frac{\displaystyle\left(\frac{\displaystyle S_2^2}{\displaystyle n_2}\right)^2}{\displaystyle n_2-1}}$$

Saturday, July 25, 2009

Two sample Student's t-test #1

t-Test to compare the means of two groups under the assumption that both samples are random, independent, and come from normally distributed population with unknow but equal variances

Here I will use the same data just seen in a previous post. The data are given below:

A: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179
B: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180


To solve this problem we must use to a Student's t-test with two samples, assuming that the two samples are taken from populations that follow a Gaussian distribution (if we cannot assume that, we must solve this problem using the non-parametric test called Wilcoxon-Mann-Whitney test; we will see this test in a future post). Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher's F-test to verify the homoskedasticity (homogeneity of variances). In R you can do this in this way:


a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

var.test(a,b)

F test to compare two variances

data: a and b
F = 2.1028, num df = 9, denom df = 9, p-value = 0.2834
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5223017 8.4657950
sample estimates:
ratio of variances
2.102784


We obtained p-value greater than 0.05, then we can assume that the two variances are homogeneous. Indeed we can compare the value of F obtained with the tabulated value of F for alpha = 0.05, degrees of freedom of numerator = 9, and degrees of freedom of denominator = 9, using the function qf(p, df.num, df.den):


qf(0.95, 9, 9)
[1] 3.178893


Note that the value of F computed is less than the tabulated value of F, which leads us to accept the null hypothesis of homogeneity of variances.
NOTE: The F distribution has only one tail, so with a confidence level of 95%, p = 0.95. Conversely, the t-distribution has two tails, and in the R's function qt(p, df) we insert a value p = 0975 when you're testing a two-tailed alternative hypothesis.

Then call the function t.test for homogeneous variances (var.equal = TRUE) and independent samples (paired = FALSE: you can omit this because the function works on independent samples by default) in this way:


t.test(a,b, var.equal=TRUE, paired=FALSE)

Two Sample t-test

data: a and b
t = -0.9474, df = 18, p-value = 0.356
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.93994 4.13994
sample estimates:
mean of x mean of y
174.8 178.2


We obtained p-value greater than 0.05, then we can conclude that the averages of two groups are significantly similar. Indeed the value of t-computed is less than the tabulated t-value for 18 degrees of freedom, which in R we can calculate:


qt(0.975, 18)
[1] 2.100922


This confirms that we can accept the null hypothesis H0 of equality of the means.

Friday, July 24, 2009

One sample Student's t-test

Comparison of the sample mean with a known value, when the variance of the population is not known.

Consider the exercise we have just seen before.
It was made an intelligence test in 10 subjects, and here are the results obtained. The average result of the population whici received the same test, is equal to 75. You want to check if the sample mean is significantly similar (when the significance level is 95%) to the average population, assuming that the variance of the population is not known.
65, 78, 88, 55, 48, 95, 66, 57, 79, 81


Contrary to the one sample Z-test, the Student's t-test for a single sample have a pre-set function in R we can apply immediately.
It is the t.test (a, mu), we can see below applied.


a = c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81)

t.test (a, mu=75)

One Sample t-test

data: a
t = -0.783, df = 9, p-value = 0.4537
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
60.22187 82.17813
sample estimates:
mean of x
71.2


The function t.test on one sample provides in output the value of t calculated; also gives us degrees of freedom, the confidence interval and the average (mean of x).
In order to take your statistic decision, you can proceed in two ways. We can compare the value of t with the value of the tabulated student t with 9 degrees of freedom. If we do not have tables, we can calculate the value t-tabulated in the following way:


qt(0.975, 9)
[1] 2.262157



The function qt (p, df) returns the value of t computed considering the significance level (we chose a significance level equal to 95%, which means that each tail is the 2.5% which corresponds to the value of p = 1 - 0.025), and the degrees of freedom. By comparing the value of t-tabulated with t-computed, t-computed appears smaller, which means that we accept the null hypothesis of equality of the averages: our sample mean is significantly similar to the mean of the population.

Alternatively we could consider the p-value. With a significance level of 95%, remember this rule: If p-value is greater than 0.05 then we accept the null hypothesis H0; if p-value is less than 0.05 then we reject the null hypothesis H0 in favor of the alternative hypothesis H1.