## Thursday, July 30, 2009

### Analysis of variance: ANOVA, for multiple comparisons

Analysis of variance: ANOVA, for multiple comparisons

The ANOVA model can be used to compare the mean of several groups with each other, using a parametric method (assuming that the groups follow a Gaussian distribution).
Proceed with the following example:

The manager of a supermarket chain wants to see if the consumption in kilowatts of 4 stores between them are equal. He collects data at the end of each month for 6 months. The results are:
Store A: 65, 48, 66, 75, 70, 55
Store B: 64, 44, 70, 70, 68, 59
Store C: 60, 50, 65, 69, 69, 57
Store D: 62, 46, 68, 72, 67, 56

To proceed with the verification ANOVA, we must first verify the homoskedasticity (ie test for homogeneity of variances). The software R provides two tests: the Bartlett test, and the Fligner-Killeen test.

We begin with the Bartlett test.

First we create the 4 vectors:

`a = c(65, 48, 66, 75, 70, 55)b = c(64, 44, 70, 70, 68, 59)c = c(60, 50, 65, 69, 69, 57)d = c(62, 46, 68, 72, 67, 56)`

Now we combine the 4 vectors in a single vector:

`dati = c(a, b, c, d)`

Now, on this vector in which are stored all the data, we create the 4 levels:

`groups = factor(rep(letters[1:4], each = 6))`

We can observe the contents of the vector `groups` simply by typing `groups + [enter]`.

At this point we start the Bartlett test:

`bartlett.test(dati, groups)        Bartlett test of homogeneity of variancesdata:  dati and groups Bartlett's K-squared = 0.4822, df = 3, p-value = 0.9228`

The function gave us the value of the statistical tests (K squared), and the p-value. Can be argued that the variances are homogeneous since p-value > 0.05. Alternatively, we can compare the Bartlett's K-squared with the value of chi-square-tables; we compute that value, assigning the value of alpha and degrees of freedom at the `qchisq` function:

`qchisq(0.950, 3) 7.814728`

Chi-squared > Bartlett's K-squared: we accept the null hypothesis H0 (variances homogeneity)

We try now to check the homoskedasticity, with the Fligner-Killeen test.
The syntax is quite similar, and then proceed quickly.

`a = c(65, 48, 66, 75, 70, 55)b = c(64, 44, 70, 70, 68, 59)c = c(60, 50, 65, 69, 69, 57)d = c(62, 46, 68, 72, 67, 56)dati = c(a, b, c, d)groups = factor(rep(letters[1:4], each = 6))fligner.test(dati, groups)        Fligner-Killeen test of homogeneity of variancesdata:  dati and groups Fligner-Killeen:med chi-squared = 0.1316, df = 3, p-value = 0.9878`

The conclusions are similar to those for the test of Bartlett.

Having verified the homoskedasticity of the 4 groups, we can proceed with the ANOVA model.

First organize the values, fitting the model:

`fit = lm(formula = dati ~ groups)`

Then we analyze the ANOVA model:

`anova (fit)Analysis of Variance TableResponse: dati          Df  Sum Sq Mean Sq F value Pr(>F)groups     3    8.46    2.82  0.0327 0.9918Residuals 20 1726.50   86.33  `

The output of the function is a classical ANOVA table with the following data:
`Df` = degree of freedom
`Sum Sq` = deviance (within groups, and residual)
`Mean Sq` = variance (within groups, and residual)
`F value` = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
`Pr(>F)` = p-value

Since p-value > 0.05, we accept the null hypothesis H0: the four means are statistically equal. We can also compare the computed F-value with the tabulated F-value:

`qf(0.950, 20, 3) 8.66019`

Tabulated F-value > computed F-value: we accept the null hyptohesis.

1. Sir, I wanted to compute Fisher's least significant difference (LSD) method
I used the following code to read the table

library(RODBC)
> library(agricolae)
> channel <- odbcConnectExcel("G:/oneway.xls")
> mydata <- sqlFetch(channel,"Sheet1")
> library(reshape)
> mydata.m <- melt(mydata, id=c("Treatment"))
> fit <- aov(value~Treatment, data=mydata.m)
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 9 23494.8 2610.54 1589.5 < 2.2e-16 ***
Residuals 23 37.8 1.64
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

then I used the following code as given in the following sites

http://cran.r-project.org/web/packages/agricolae/index.html
http://tarwi.lamolina.edu.pe/~fmendiburu/

library(agricolae)
LSD.test()
comparison <- LSD.test(Treatment,Residuals,df,MSerror,group=F)

but I could not proceed any further,

How can I read ANOVA data from an Excel table and find out the LSD and graph it as done in Agricolae. This information will help me analyse field experimental data, thanks, Samuel, Bangalore, India

2. Hi Samuel,
why you want to read the ANOVA table from Excel, while you have all data in R, to perform the LSD.test?

With your data, you can proceed as follow:

df<-df.residual(fit)
MSerror<-deviance(fit)/df
library(agricolae)
comparison <- LSD.test(value, Treatment, df, MSerror, group=FALSE)
comparison
bar.err(comparison,std=TRUE,ylim=c(0,45),density=4,border="blue")

So you can perform your post-hoc test.

3. Sorry, before the comparison, you have to attach your data:

attach(mydata.m)

4. Sir, it works so nicely, thanks, the reason I wanted to use is this, I wanted to do my data entry in MS excel as I wanted multiple columns, each column as a replication.
Treatment Rep1 Rep2 Rep3
Tridemorph (0.1%) 33.89 32.45 35.67

Sir I have another request, in the graph I would be happy if each bar has the treatment name as its label, now I am seeing only 3 labels,
thanking you very much, yours sincerely,
Dr.D.K.Samuel, Plant Virologist, Ind Inst of Horticultural Research, Bangalore - 89, India

5. Hi Samuel,
A question: are you performing a comparison between Rep1, Rep2, Rep3; or are you performing a comparison between Treatments, considering only one Rep?
Because you cannot perform an ANOVA considering replication AND treatment simultaneously: it is a "repeated measures ANOVA", which in R has another syntax.

However the bar.err() function plots the name of the factor you are considering in default.

(maybe, if you want, send at my e-mail your Excel file with all data, so I can have a better idea of your problem)

6. Hi Sir ,

I have the same test dataset organized in two different formats. I tried to do single factor ANOVA on both the formats which resulted two different results for P value. Can you please help me to know where the problem is? I am afraid I did something wrong in declaring the factor variable.

Dataset 1

110 17 11 10
13 12 7 12
14 18 9 8
13 13 13 7

Dataset 2
10 1
13 1
14 1
13 1
17 2
12 2
18 2
13 2
11 3
7 3
9 3
13 3
10 4
12 4
8 4
7 4

I followed this procedure for Dataset 1 in R following the blog

> test1 <- read.table("c://anova/dataset1.txt")
> names(test1) = c ("low" , "mod","high","veryhigh")
> attach(test1)
> merge <- c(low , mod , high , veryhigh)
> groups <- factor(rep(letters[1:4], each = 4))
> fit <- lm(formula = merge ~ groups)
> anova(fit)

Analysis of Variance Table

Response: merge
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 2119.2 706.4 1.199 0.3518
Residuals 12 7069.8 589.1

I followed this procedure for Dataset 2 in R follwing (http://www.agr.kuleuven.ac.be/vakken/statisticsbyR/ANOVAbyRr/multiplecomp.htm)

> test2 <- read.table ("c://anova/dataset2.txt")
> names(test2) = c("data", "Brand")
> attach(test2)
> Brand <- factor(Brand)
> fit <- lm(formula = data ~ Brand)
> anova(fit)

Analysis of Variance Table

Response: data
Df Sum Sq Mean Sq F value Pr(>F)
Brand 3 81.688 27.229 4.6846 0.02176 *
Residuals 12 69.750 5.812
---
This will help me to do ANOVA in the orginal dataset for my analysis.

Many Thanks ,

Rodney , Australia
Rodney

7. Hi Rodney,
the two methods for declaring groups are totally equivalent, and you write the right code.
There is only an error:

Group 1 of dataset 1 is:
110, 13, 14, 13
Group 1 of dataset 2 is:
10, 13, 14, 13

Of course the two ANOVA tables are different (different variables). If you correct the first variable, you will obtain the same ANOVA table.

:)

8. Hi Sir ,

It was really a silly mistake. Thank you very much for pointing it out.

Best Regards,

Rodney

9. thank you

10. I have a doubt: I´ve run an anova of 2 factors with only one observation per cell. In this case one can´t test the homogeneity of variances due to lack of observations in each group.However, if you run a bartlett test, you get a result. What does it do in this case? I can´t understand

11. Hi Sir
Would it be possible to run the Flingner-Killeen test on the data from the four stores if one of the managers had forgotten to report their consumption of kilowatts one month? what would be the r-code for the groups be then?
Best regards

12. Thank you for this nice tutorial. Another good ressources on ANOVA can be found at http://www.sthda.com/english/wiki/one-way-anova-test-in-r