Friday, July 31, 2009

Kruskal-Wallis one-way analysis of variance

If you have to perform the comparison between multiple groups, but you can not run a ANOVA for multiple comparisons because the groups do not follow a normal distribution, you can use the Kruskal-Wallis test, which can be applied when you can not make the assumption that the groups follow a gaussian distribution.
This test is similar to the Wilcoxon test for 2 samples.

Suppose you want to see if the means of the following 4 sets of values are statistically similar:

Group A: 1, 5, 8, 17, 16
Group B: 2, 16, 5, 7, 4
Group C: 1, 1, 3, 7, 9
Group D: 2, 15, 2, 9, 7

To use the test of Kruskal-Wallis simply enter the data, and then organize them into a list:


a = c(1, 5, 8, 17, 16)
b = c(2, 16, 5, 7, 4)
c = c(1, 1, 3, 7, 9)
d = c(2, 15, 2, 9, 7)

dati = list(g1=a, g2=b, g3=c, g4=d)

Now we can apply the kruskal.test() function:


kruskal.test(dati)

        Kruskal-Wallis rank sum test

data:  dati 
Kruskal-Wallis chi-squared = 1.9217, df = 3, p-value = 0.5888

The value of the test statistic is 1.9217. This value already contains the fix when there are ties (repetitions). The p-value is greater than 0.05; also the value of the test statistic is lower than the chi-square-tabulation:


qchisq(0.950, 3)
[1] 7.814728

The conclusion is therefore that I accept the null hypothesis H0: the means of the 4 groups are statistically equal.

Thursday, July 30, 2009

Analysis of variance: ANOVA, for multiple comparisons

Analysis of variance: ANOVA, for multiple comparisons

The ANOVA model can be used to compare the mean of several groups with each other, using a parametric method (assuming that the groups follow a Gaussian distribution).
Proceed with the following example:

The manager of a supermarket chain wants to see if the consumption in kilowatts of 4 stores between them are equal. He collects data at the end of each month for 6 months. The results are:

Store A: 65, 48, 66, 75, 70, 55
Store B: 64, 44, 70, 70, 68, 59
Store C: 60, 50, 65, 69, 69, 57
Store D: 62, 46, 68, 72, 67, 56

To proceed with the verification ANOVA, we must first verify the homoskedasticity (ie test for homogeneity of variances). The software R provides two tests: the Bartlett test, and the Fligner-Killeen test.

We begin with the Bartlett test.

First we create the 4 vectors:


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

Now we combine the 4 vectors in a single vector:


dati = c(a, b, c, d)

Now, on this vector in which are stored all the data, we create the 4 levels:


groups = factor(rep(letters[1:4], each = 6))

We can observe the contents of the vector groups simply by typing groups + [enter].

At this point we start the Bartlett test:


bartlett.test(dati, groups)

        Bartlett test of homogeneity of variances

data:  dati and groups 
Bartlett's K-squared = 0.4822, df = 3, p-value = 0.9228

The function gave us the value of the statistical tests (K squared), and the p-value. Can be argued that the variances are homogeneous since p-value > 0.05. Alternatively, we can compare the Bartlett's K-squared with the value of chi-square-tables; we compute that value, assigning the value of alpha and degrees of freedom at the qchisq function:


qchisq(0.950, 3)
[1] 7.814728

Chi-squared > Bartlett's K-squared: we accept the null hypothesis H0 (variances homogeneity)

We try now to check the homoskedasticity, with the Fligner-Killeen test.
The syntax is quite similar, and then proceed quickly.


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

dati = c(a, b, c, d)

groups = factor(rep(letters[1:4], each = 6))

fligner.test(dati, groups)

        Fligner-Killeen test of homogeneity of variances

data:  dati and groups 
Fligner-Killeen:med chi-squared = 0.1316, df = 3, p-value = 0.9878

The conclusions are similar to those for the test of Bartlett.

Having verified the homoskedasticity of the 4 groups, we can proceed with the ANOVA model.

First organize the values, fitting the model:


fit = lm(formula = dati ~ groups)

Then we analyze the ANOVA model:


anova (fit)

Analysis of Variance Table

Response: dati
          Df  Sum Sq Mean Sq F value Pr(>F)
groups     3    8.46    2.82  0.0327 0.9918
Residuals 20 1726.50   86.33

The output of the function is a classical ANOVA table with the following data:
Df = degree of freedom
Sum Sq = deviance (within groups, and residual)
Mean Sq = variance (within groups, and residual)
F value = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
Pr(>F) = p-value

Since p-value > 0.05, we accept the null hypothesis H0: the four means are statistically equal. We can also compare the computed F-value with the tabulated F-value:

qf(0.950, 20, 3)
[1] 8.66019

Tabulated F-value > computed F-value: we accept the null hyptohesis.

Wednesday, July 29, 2009

Comparison of two proportions: parametric (Z-test) and non-parametric (chi-squared) methods

Consider for example the following problem.
The owner of a betting company wants to verify whether a customer is cheating or not. To do this want to compare the number of successes of one player with the number of successes of one of his employees, of which he is certain that he is not cheating. In a month's time, the player performs 74 bets and wins 30; the player in the same period of time making 103 bets, wins 65. Your client is a cheat or not?

A problem of this kind can be solved in two different ways: using a parametric and a non-parametric method.

* Solution with the parametric method: Z-test.

You can use a Z-test if you can do the following two assumptions: the probability of common success is approximate 0.5, and the number of games is very high (under these assumption, a binomial distribution is approximate a gaussian distribution). Suppose that this is the case. In R there is no function to calculate the value of Z, so we remember the mathematical formula, and we write our function:

$$Z=\frac{\frac{x_1}{n_1}-\frac{x_2}{n_s}}{\sqrt{\widehat{p}(1-\widehat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}$$


z.prop = function(x1,x2,n1,n2){
  numerator = (x1/n1) - (x2/n2)
  p.common = (x1+x2) / (n1+n2)
  denominator = sqrt(p.common * (1-p.common) * (1/n1 + 1/n2))
  z.prop.ris = numerator / denominator
  return(z.prop.ris)
}

Z.prop function calculates the value of Z, receiving input the number of successes (x1 and x2), and the total number of games (n1 and n2). We apply the function just written with the data of our problem:


z.prop(30, 65, 74, 103)
[1] -2.969695

We obtained a value of z greater than the value of z-tabulated (1.96), which leads us to conclude that the player that the director was looking at is actually a cheat, since its probability of success is higher than a non-cheat user.

* Solution with the non-parametric method: Chi-squared test.

Suppose now that it can not make any assumption on the data of the problem, so that it can not approximate the binomial with a Gauss. We solve the problem with the test of chi-square applied to a 2x2 contingency table. In R there is the function prop.test.


prop.test(x = c(30, 65), n = c(74, 103), correct = FALSE)

    2-sample test for equality of proportions without continuity correction

data: c(30, 65) out of c(74, 103)
X-squared = 8.8191, df = 1, p-value = 0.002981
alternative hypothesis: two.sided
95 percent confidence interval:
  -0.37125315 -0.08007196
sample estimates:
prop 1     prop 2
0.4054054  0.6310680

Prop.test function calculates the value of chi-square, given the values of success (in the vector x) and total attempts (in the vector n). The vectors x and n can also be previously declared, and then be retrieved as usual: prop.test (x, n, correct = FALSE).

In the case of small samples (low value of n), you must specify correct = TRUE, so as to change the computation of chi-square based on the continuity of Yates:


prop.test(x = c(30, 65), n = c(74, 103), correct=TRUE)

    2-sample test for equality of proportions with continuity correction

data: c(30, 65) out of c(74, 103)
X-squared = 7.9349, df = 1, p-value = 0.004849
alternative hypothesis: two.sided
95 percent confidence interval:
  -0.38286428 -0.06846083
sample estimates:
prop 1     prop 2
0.4054054  0.6310680

In both cases, we obtained p-value less than 0.05, which leads us to reject the hypothesis of equal probability. In conclusion, the customer is a cheat. For confirmation we compare the value chi-square-value calculated with the chi-square-tabulation, which we calculate in this way:


qchisq(0.950, 1)
[1] 3.841459

qchisq function calculates the value of chi-square as a function of alpha and degrees of freedom. Since chi-square-calculated is greater than chi-square-tabulation, we conclude by rejecting the hypothesis H0 (as stated by the p-value, and the parametric test).

Wilcoxon signed rank test

Non-parametric statistical hypothesis test, for the comparison of the means between 2 paired samples

The mayor of a city wants to see if pollution levels are reduced by closing the streets to the car traffic. This is measured by the rate of pollution every 60 minutes (8am 22pm: total of 15 measurements) in a day when traffic is open, and in a day of closure to traffic. Here the values of air pollution:

With traffic: 214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234
Without traffic: 159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112

It is clear that the two groups are paired, because there is a bond between the readings, consisting in the fact that we are considering the same city (with its peculiarities weather, ventilation, etc.) albeit in two different days. Not being able to assume a Gaussian distribution for the values recorded, we must proceed with a non-parametric test, the Wilcoxon signed rank test.


a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)

wilcox.test(a,b, paired=TRUE)

Wilcoxon signed rank test

data: a and b
V = 80, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0

Since the p-value is greater than 0.05, we conclude that the means have remained essentially unchanged (we accept the null hypothesis H0), then blocking traffic for a single day did not lead to any improvement in terms of pollution of the city.
The value V = 80 corresponds to the sum of ranks assigned to the differences with positive sign. We can manually calculate the sum of ranks assigned to the differences with positive sign, and the sum of ranks assigned to the differences with negative sign, to compare this interval with the interval tabulated on the tables of Wilcoxon for paired samples, and confirm our statistic decision. Here's how to calculate the two sums.


a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)

 diff <- c(a - b) #calculating the vector containing the differences
 diff <- diff[ diff!=0 ] #delete all differences equal to zero
 diff.rank <- rank(abs(diff)) #check the ranks of the differences, taken in absolute
 diff.rank.sign <- diff.rank * sign(diff) #check the sign to the ranks, recalling the signs of the values of the differences
 ranks.pos <- sum(diff.rank.sign[diff.rank.sign > 0]) #calculating the sum of ranks assigned to the differences as a positive, ie greater than zero
 ranks.neg <- -sum(diff.rank.sign[diff.rank.sign < 0]) #calculating the sum of ranks assigned to the differences as a negative, ie less than zero
 ranks.pos #it is the value V of the wilcoxon signed rank test
[1] 80
 ranks.neg
[1] 40

The computed interval is (40, 80). The tabulated interval on Wilcoxon paired samples tables, with 15 differences, is (25, 95). Since the calculated interval is contained in the tabulated, we accept the null hypothesis H0 of equality of the means. As predicted by the p-value, closing roads to traffic did not bring any improvement in terms of rate of pollution.

Tuesday, July 28, 2009

Wilcoxon-Mann-Whitney rank sum test (or test U)

Comparison of the averages of two independent groups of samples, of which we can not assume a distribution of Gaussian type; is also known as Mann-Whitney U-test.

You want to see if the mean of goals suffered by two football teams over the years is the same. Are below the number of goals suffered by each team in 6 games for each year.

Team A: 6, 8, 2, 4, 4, 5
Team B: 7, 10, 4, 3, 5, 6

The Wilcoxon-Matt-Whitney test (or Wilcoxon rank sum test, or Mann-Whitney U-test) is used when is asked to compare the means of two groups that do not follow a normal distribution: it is a non-parametrical test. It is the equivalent of the t test, applied for independent samples.
Let's see how to solve the problem with R:


a = c(6, 8, 2, 4, 4, 5)
b = c(7, 10, 4, 3, 5, 6)

wilcox.test(a,b, correct=FALSE)

Wilcoxon rank sum test

data: a and b
W = 14, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0

The p-value is greater than 0.05, then we can accept the hypothesis H0 of statistical equality of the means of two groups.
If you run wilcox.test(b, a, correct = FALSE), the p-value would be logically the same:


a = c(6, 8, 2, 4, 4, 5)
b = c(7, 10, 4, 3, 5, 6)

wilcox.test(b,a, correct=FALSE)

Wilcoxon rank sum test

data: b and a
W = 22, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0

The value W is so computed:


sum.rank.a = sum(rank(c(a,b))[1:6]) #sum of ranks assigned to the group a
 W = sum.rank.a – (length(a)*(length(a)+1)) / 2
 W
[1] 14

sum.rank.b = sum(rank(c(a,b))[7:12]) #sum of ranks assigned to the group b
 W = sum.rank.b – (length(b)*(length(b)+1)) / 2 
 W
[1] 22

We can finally compare the intervals tabulated on the tables of Wilcoxon for independent samples. The tabulated interval for two groups of 6 samples each is (26, 52), while the interval of our samples is:


sum(rank(c(a,b))[1:6]) #sum of ranks assigned to the group a
[1] 35
sum(rank(c(a,b))[7:12]) #sum of ranks assigned to the group b
[1] 43

Since the computed interval (35, 43), is contained within the tabulated interval (26, 52), we conclude by accepting the hypothesis H0 of equality of means.

Monday, July 27, 2009

Paired Student's t-test

Comparison of the means of two sets of paired samples, taken from two populations with unknown variance.

A school athletics has taken a new instructor, and want to test the effectiveness of the new type of training proposed by comparing the average times of 10 runners in the 100 meters. Are below the time in seconds before and after training for each athlete.

Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
After training: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1

In this case we have two sets of paired samples, since the measurements were made on the same athletes before and after the workout. To see if there was an improvement, deterioration, or if the means of times have remained substantially the same (hypothesis H0), we need to make a Student's t-test for paired samples, proceeding in this way:


a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)

t.test(a,b, paired=TRUE)

    Paired t-test

data: a and b
t = -0.2133, df = 9, p-value = 0.8358
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
    -0.5802549 0.4802549
sample estimates:
mean of the differences
    -0.05

The p-value is greater than 0.05, then we can accept the hypothesis H0 of equality of the averages. In conclusion, the new training has not made any significant improvement (or deterioration) to the team of athletes.
Similarly, we calculate the t-tabulated value:


qt(0.975, 9)
[1] 2.262157

t-computed < t-tabulated, so we accept the null hypothesis H0.

Suppose now that the manager of the team (given the results obtained) fired the coach who has not made any improvement, and take another, more promising. We report the times of athletes after the second training:

Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
After the second training: 12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0

Now we check if there was actually an improvement, ie perform a t-test for paired data, specifying in R to test the alternative hypothesis H1 of improvement in times. To do this simply add the syntax alt = "less" when you call the t-test:


a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0)

t.test(a,b, paired=TRUE, alt="less")

    Paired t-test

data: a and b
t = 5.2671, df = 9, p-value = 0.9997
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
    -Inf 2.170325
sample estimates:
mean of the differences
     1.61

With this syntax we asked R to check whether the mean of the values contained in the vector a is less of the mean of the values contained in the vector b. In response, we obtained a p-value well above 0.05, which leads us to conclude that we can reject the null hypothesis H0 in favor of the alternative hypothesis H1: the new training has made substantial improvements to the team.

If we had written: t.test (a, b, paired = TRUE, alt = "greater"), we asked R to check whether the mean of the values contained in the vector a is greater than the mean of the values contained in the vector b. In light of the previous result, we can suspect that the p-value will be much smaller than 0.05, and in fact:


a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0)

t.test(a,b, paired=TRUE, alt="greater")

    Paired t-test

data: a and b
t = 5.2671, df = 9, p-value = 0.0002579
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
    1.049675 Inf
sample estimates:
mean of the differences
    1.61

Sunday, July 26, 2009

Two sample Student's t-test #2

Comparison of the averages of two independent groups, extracted from two populations at variance unknown; sample variances are not homogeneous.

We want to compare the heights in inches of two groups of individuals. Here the measurements:

A: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179
B: 120, 180, 125, 188, 130, 190, 110, 185, 112, 188

As we have seen in a previous exercise, we must first check whether the variances are homogeneous (homoskedasticity) with a F-test of Fisher:


a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
 b = c(120, 180, 125, 188, 130, 190, 110, 185, 112, 188)

var.test(b,a)

    F test to compare two variances

data: b and a
F = 14.6431, num df = 9, denom df = 9, p-value = 0.0004636
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
   3.637133 58.952936
sample estimates:
 ratio of variances
   14.64308

We obtained p-value less than 0.05, then the two variances are not homogeneous. Indeed we can compare the value of F computed with the tabulated value of F for alpha = 0.05, degrees of freedom at numerator = 9, and degrees of freedom of denominator = 9, using the function qf(p, df.num, df.den):


qf(0.95, 9, 9)
[1] 3.178893

F-computed is greater than F-tabulated, so we can reject the null hypothesis H0 of homogeneity of variances.

To make the comparison between the two groups, we use the function t.test with not homogeneous variances (var.equal = FALSE, which can also be omitted, because the function works on non-homogeneous variance by default) and independent samples (paired = FALSE, which can also be omitted, because by default the function works on independent samples) in this way:


t.test(a,b, var.equal=FALSE, paired=FALSE)

    Welch Two Sample t-test

data: a and b
t = 1.8827, df = 10.224, p-value = 0.08848
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
    -3.95955 47.95955
sample estimates:
mean of x mean of y
    174.8     152.8

As we see in the headline, you made a t-test on two samples with the calculation of degrees of freedom using the formula of Welch-Satterthwaite (the result of the formula is df = 10,224), which is used in cases where the variances are not homogeneous. Welch-Satterthwaite equation is also called Dixon-Massey formula when you make the comparison between two groups, as in this case.
We obtained p-value greater than 0.05, then we can conclude that the means of the two groups are significantly similar (albeit p-value is very close to the threshold 0.05). Indeed the value of t is less than the tabulated t-value for 10,224 degrees of freedom, which in R we can calculate:


qt(0.975, 10.224)
[1] 2.221539

We can accept the hypothesis H0 of equality of means.

Welch-Satterthwaite formula:

$$df=\frac{\sum deviance(X)}{\sum df(X)}=\frac{\sum_{i=1}^{k} (\sum_{j=1}^{n} (X_{ij} - \bar{X_i})^2}{\sum_{i=1}^{k}(n_i-1)}$$

Dixon-Massey formula:

$$df=\frac{\left(\frac{\displaystyle S_1^2}{\displaystyle n_1}+\frac{\displaystyle S_2^2}{\displaystyle n_2}\right)^2}{\frac{\displaystyle\left(\frac{\displaystyle S_1^2}{\displaystyle n_1}\right)^2}{\displaystyle n_1-1}+\frac{\displaystyle\left(\frac{\displaystyle S_2^2}{\displaystyle n_2}\right)^2}{\displaystyle n_2-1}}$$

Saturday, July 25, 2009

Two sample Student's t-test #1

t-Test to compare the means of two groups under the assumption that both samples are random, independent, and come from normally distributed population with unknow but equal variances

Here I will use the same data just seen in a previous post. The data are given below:

A: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179
B: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180

To solve this problem we must use to a Student's t-test with two samples, assuming that the two samples are taken from populations that follow a Gaussian distribution (if we cannot assume that, we must solve this problem using the non-parametric test called Wilcoxon-Mann-Whitney test; we will see this test in a future post). Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher's F-test to verify the homoskedasticity (homogeneity of variances). In R you can do this in this way:


a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

var.test(a,b)

     F test to compare two variances

data: a and b
F = 2.1028, num df = 9, denom df = 9, p-value = 0.2834
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
   0.5223017 8.4657950
sample estimates:
 ratio of variances
2.102784

We obtained p-value greater than 0.05, then we can assume that the two variances are homogeneous. Indeed we can compare the value of F obtained with the tabulated value of F for alpha = 0.05, degrees of freedom of numerator = 9, and degrees of freedom of denominator = 9, using the function qf(p, df.num, df.den):


qf(0.95, 9, 9)
[1] 3.178893

Note that the value of F computed is less than the tabulated value of F, which leads us to accept the null hypothesis of homogeneity of variances.
NOTE: The F distribution has only one tail, so with a confidence level of 95%, p = 0.95. Conversely, the t-distribution has two tails, and in the R's function qt(p, df) we insert a value p = 0975 when you're testing a two-tailed alternative hypothesis.

Then call the function t.test for homogeneous variances (var.equal = TRUE) and independent samples (paired = FALSE: you can omit this because the function works on independent samples by default) in this way:


t.test(a,b, var.equal=TRUE, paired=FALSE)

Two Sample t-test

data: a and b
t = -0.9474, df = 18, p-value = 0.356
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -10.93994 4.13994
sample estimates:
mean of x mean of y
    174.8     178.2

We obtained p-value greater than 0.05, then we can conclude that the averages of two groups are significantly similar. Indeed the value of t-computed is less than the tabulated t-value for 18 degrees of freedom, which in R we can calculate:


qt(0.975, 18)
[1] 2.100922

This confirms that we can accept the null hypothesis H0 of equality of the means.

Friday, July 24, 2009

One sample Student's t-test

Comparison of the sample mean with a known value, when the variance of the population is not known.

Consider the exercise we have just seen before.
It was made an intelligence test in 10 subjects, and here are the results obtained. The average result of the population whici received the same test, is equal to 75. You want to check if the sample mean is significantly similar (when the significance level is 95%) to the average population, assuming that the variance of the population is not known.

65, 78, 88, 55, 48, 95, 66, 57, 79, 81

Contrary to the one sample Z-test, the Student's t-test for a single sample have a pre-set function in R we can apply immediately.
It is the t.test (a, mu), we can see below applied.


a = c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81)

t.test (a, mu=75)

One Sample t-test

data: a
t = -0.783, df = 9, p-value = 0.4537
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
  60.22187 82.17813
sample estimates:
 mean of x
71.2

The function t.test on one sample provides in output the value of t calculated; also gives us degrees of freedom, the confidence interval and the average (mean of x).
In order to take your statistic decision, you can proceed in two ways. We can compare the value of t with the value of the tabulated student t with 9 degrees of freedom. If we do not have tables, we can calculate the value t-tabulated in the following way:


qt(0.975, 9)
[1] 2.262157

The function qt (p, df) returns the value of t computed considering the significance level (we chose a significance level equal to 95%, which means that each tail is the 2.5% which corresponds to the value of p = 1 - 0.025), and the degrees of freedom. By comparing the value of t-tabulated with t-computed, t-computed appears smaller, which means that we accept the null hypothesis of equality of the averages: our sample mean is significantly similar to the mean of the population.

Alternatively we could consider the p-value. With a significance level of 95%, remember this rule: If p-value is greater than 0.05 then we accept the null hypothesis H0; if p-value is less than 0.05 then we reject the null hypothesis H0 in favor of the alternative hypothesis H1.

Thursday, July 23, 2009

Two sample Z-test

Comparison of the means of two independent groups of samples, taken from two populations with known variance.

Is asked to compare the average heights of two groups. The first group (A) consists of individuals of Italian nationality (the variance of the Italian population is 5); the second group is taken from individuals of German nationality (the variance of German population variance is 8.5). The data are given below:

A: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179
B: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180

Since we have the variance of the population, we must proceed with a two sample Z-test. Even in this case is not avalilable in R a function to solve the problem, but we can easily create it ourselves.

$$Z=\frac{(\overline{x}_1-\overline{x}_2)-(\mu_1-\mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}$$


z.test2sam = function(a, b, var.a, var.b){
   n.a = length(a)
   n.b = length(b)
   zeta = (mean(a) - mean(b)) / (sqrt(var.a/n.a + var.b/n.b))
   return(zeta)
}

The function z.test2sam provides in output the value of zeta, after receiving in input two vectors (a and b), the variance of the first population (var.a) and the variance of the second population (var.b).
Using this function we obtain:


a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

z.test2sam(a, b, 5, 8.5)
[1] -2.926254

The value of zeta is greater than the value of the critical value zeta tabulated for alpha equal to 0.05 (z-tabulated = 1.96 for a two-tailed test): then we reject the null hypothesis in favor of the alternative hypothesis. We conclude that the two means are significantly different.

Wednesday, July 22, 2009

One sample Z-test

Comparison of the sample mean with know population mean and standard deviation.

Suppose that 10 volunteers have done an intelligence test; here are the results obtained. The mean obtained at the same test, from the entire population is 75. You want to check if there is a statistically significant difference (with a significance level of 95%) between the means of the sample and the population, assuming that the sample variance is known and equal to 18.

65, 78, 88, 55, 48, 95, 66, 57, 79, 81

To solve this problem it is necessary to develop a one sample Z-test. In R there isn't a similar function, so we can create our function.
Recalling the formula for calculating the value of z, we will write this function:

$$Z=\frac{\overline{x}-\mu_0}{\frac{\sigma}{\sqrt{n}}}$$


z.test = function(a, mu, var){
   zeta = (mean(a) - mu) / (sqrt(var / length(a)))
   return(zeta)
}

We have built so the function z.test; it receives in input a vector of values (a), the mean of the population to perform the comparison (mu), and the population variance (var); it returns the value of zeta. Now apply the function to our problem.


a = c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81)

z.test(a, 75, 18)
[1] -2.832353

The value of zeta is equal to -2.83, which is higher than the critical value Zcv = 1.96, with alpha = 0.05 (2-tailed test). We conclude therefore that the mean of our sample is significantly different from the mean of the population.

Tuesday, July 21, 2009

Geometric and harmonic means in R

Compute the geometric mean and harmonic mean in R of this sequence.

10, 2, 19, 24, 6, 23, 47, 24, 54, 77

These features are not present in the standard package of R, although they are easily available in some packets.
However, it is easy to calculate these values simply by remembering the mathematical formulas, and applying them in R.

$$H = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}} = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}}, \qquad x_i > 0 \text{ for all } i.$$

In R language:


a = c(10, 2, 19, 24, 6, 23, 47, 24, 54, 77)

1/mean(1/a) #compute the harmonic mean
[1] 10.01109

$$\overline{a}_{geom}=\bigg(\prod_{i=1}^n a_i \bigg)^{1/n} = \sqrt[n]{a_1 \cdot a_2 \cdots a_n}$$

In R language


a = c(10, 2, 19, 24, 6, 23, 47, 24, 54, 77)

n = length(a) #now n is equal to the number of elements in a

prod(a)^(1/n) #compute the geometric mean
[1] 18.92809

Monday, July 20, 2009

Probability exercise: negative binomial distribution

What is the probability you get the 4th cross before the 3rd head, flipping a coin?

The mathematical formula for solving this exercise, which follows a negative binomial distribution, is:

$$f(x)=P(X=x)=\begin{pmatrix} x+y-1\\ y-1 \end{pmatrix} \cdot p^x \cdot (1-p)^y$$

To solve a problem like this, the number of experiments is not significant, since it is sufficient to know the probability of individual shots that are done and you want to combine, to obtain the required probability. To solve this problem in R, we can use the function dnbinom(x, y, p). This distribution allows to calculate the probability that a number of failures x occurs before y-th success, in a sequence of Bernoulli trials, for which the probability of individual success is p.


dnbinom(4,3,0.5)
[1] 0.1171875

The probability of obtaining the fourth cross before the third head (and then after two head) is equal to 11,72%.

It may seem a strange result, but to convince us about the accuracy of this function of R, let us consider this other problem: what are the chances of leaving the first, second, third, ... the 25th head before the second cross?
We can obtain a histogram in R, which shows what is required, in this manner:


barplot(dnbinom(1:25,2,0.5), col="grey", names.arg=1:25)

Observe that the probability to get the first head before the second cross is 0.25, which is the product 0.5 x 0.5 (probability to get H after T). As the number of flips have been going on, the probability falls more and more.

Sunday, July 19, 2009

A probability exercise on the Bernoulli distribution

What is the probability, flipping a coin 8 times, to obtain the sequence HHTTTHTT? (H = head; T= tail)

The theory teaches us that to solve this question, we can simply use the following formula:

$$f(x)=P(X=x)=B(n,p)=\begin{pmatrix}n\\ x \end{pmatrix} \cdot p^x \cdot q^{n-x}=\frac{n!}{x!(n-x)!}$$

To solve a problem like this, we can use in R the function dbinom(x, n, p). The coin flipping follow a binomial distribution, in which every event can be H or T. Suppose that T is the number of successes x (in this case x = 5), while n is the number independet experiments (in this case n = 8). The probability of success is p = 0.5. Put these data into R and get the answer:


dbinom(5, 8, 0.5)
[1] 0.21875

The probability of obtaining that particular sequence is equal to 21,875%.
What probability we would have obtained if we had chosen H as the success (ie by imposing x = 3)?

Let us practice with some functions of R

Given the following data set, compute the arithmetic mean, median, variance, standard deviation; find the greatest and the smaller value, the sum of all values, the square of the sum of all values, the sum of the square of all values; assigne the ranks and add them, assigning the ranks to the first 6 values and add those.

10, 2, 19, 24, 6, 23, 47, 24, 54, 77

Extremely simple to perform in R, deliberately created to familiarize with some statistical functions that can serve us in the future. Not carry over mathematical formulas, which you can find in any book of mathematics, or Wikipedia.


a = c(10, 2, 19, 24, 6, 23, 47, 24, 54, 77)

mean(a) #calculate the arithmetic mean
[1] 28.6

median(a) #calculate the median
[1] 23.5

var(a) #calculate the variance
[1] 561.8222

sd(a) #calculate the standard deviation
[1] 23.70279

max(a) #shows the larger value of the sequence
[1] 77

min(a) #shows the smallest value of the sequence
[1] 2

sum(a) #sum all the values
[1] 286

sum(a)*sum(a) #calculates the square of the sum of all values
[1] 81796

sum(a*a) #calculates the sum of the squares of all values
[1] 13236

sum(rank(a)) #sum of ranks assigned to the variables contained in a
[1] 55

sum(rank(a)[1:6]) #sum of ranks assigned to the first 6 values of a
[1] 21.5