Thursday, July 30, 2009

Analysis of variance: ANOVA, for multiple comparisons

Analysis of variance: ANOVA, for multiple comparisons

The ANOVA model can be used to compare the mean of several groups with each other, using a parametric method (assuming that the groups follow a Gaussian distribution).
Proceed with the following example:

The manager of a supermarket chain wants to see if the consumption in kilowatts of 4 stores between them are equal. He collects data at the end of each month for 6 months. The results are:
Store A: 65, 48, 66, 75, 70, 55
Store B: 64, 44, 70, 70, 68, 59
Store C: 60, 50, 65, 69, 69, 57
Store D: 62, 46, 68, 72, 67, 56


To proceed with the verification ANOVA, we must first verify the homoskedasticity (ie test for homogeneity of variances). The software R provides two tests: the Bartlett test, and the Fligner-Killeen test.




We begin with the Bartlett test.

First we create the 4 vectors:


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)


Now we combine the 4 vectors in a single vector:


dati = c(a, b, c, d)


Now, on this vector in which are stored all the data, we create the 4 levels:


groups = factor(rep(letters[1:4], each = 6))


We can observe the contents of the vector groups simply by typing groups + [enter].

At this point we start the Bartlett test:


bartlett.test(dati, groups)

Bartlett test of homogeneity of variances

data: dati and groups
Bartlett's K-squared = 0.4822, df = 3, p-value = 0.9228


The function gave us the value of the statistical tests (K squared), and the p-value. Can be argued that the variances are homogeneous since p-value > 0.05. Alternatively, we can compare the Bartlett's K-squared with the value of chi-square-tables; we compute that value, assigning the value of alpha and degrees of freedom at the qchisq function:


qchisq(0.950, 3)
[1] 7.814728


Chi-squared > Bartlett's K-squared: we accept the null hypothesis H0 (variances homogeneity)




We try now to check the homoskedasticity, with the Fligner-Killeen test.
The syntax is quite similar, and then proceed quickly.


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

dati = c(a, b, c, d)

groups = factor(rep(letters[1:4], each = 6))

fligner.test(dati, groups)

Fligner-Killeen test of homogeneity of variances

data: dati and groups
Fligner-Killeen:med chi-squared = 0.1316, df = 3, p-value = 0.9878


The conclusions are similar to those for the test of Bartlett.




Having verified the homoskedasticity of the 4 groups, we can proceed with the ANOVA model.

First organize the values, fitting the model:


fit = lm(formula = dati ~ groups)


Then we analyze the ANOVA model:


anova (fit)

Analysis of Variance Table

Response: dati
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 8.46 2.82 0.0327 0.9918
Residuals 20 1726.50 86.33


The output of the function is a classical ANOVA table with the following data:
Df = degree of freedom
Sum Sq = deviance (within groups, and residual)
Mean Sq = variance (within groups, and residual)
F value = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
Pr(>F) = p-value

Since p-value > 0.05, we accept the null hypothesis H0: the four means are statistically equal. We can also compare the computed F-value with the tabulated F-value:

qf(0.950, 20, 3)
[1] 8.66019


Tabulated F-value > computed F-value: we accept the null hyptohesis.

12 comments:

  1. Sir, I wanted to compute Fisher's least significant difference (LSD) method
    I used the following code to read the table

    library(RODBC)
    > library(agricolae)
    > channel <- odbcConnectExcel("G:/oneway.xls")
    > mydata <- sqlFetch(channel,"Sheet1")
    > library(reshape)
    > mydata.m <- melt(mydata, id=c("Treatment"))
    > fit <- aov(value~Treatment, data=mydata.m)
    > summary(fit)
    Df Sum Sq Mean Sq F value Pr(>F)
    Treatment 9 23494.8 2610.54 1589.5 < 2.2e-16 ***
    Residuals 23 37.8 1.64
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    then I used the following code as given in the following sites

    http://cran.r-project.org/web/packages/agricolae/index.html
    http://tarwi.lamolina.edu.pe/~fmendiburu/

    library(agricolae)
    LSD.test()
    comparison <- LSD.test(Treatment,Residuals,df,MSerror,group=F)

    but I could not proceed any further,


    How can I read ANOVA data from an Excel table and find out the LSD and graph it as done in Agricolae. This information will help me analyse field experimental data, thanks, Samuel, Bangalore, India

    ReplyDelete
  2. Hi Samuel,
    why you want to read the ANOVA table from Excel, while you have all data in R, to perform the LSD.test?

    With your data, you can proceed as follow:

    df<-df.residual(fit)
    MSerror<-deviance(fit)/df
    library(agricolae)
    comparison <- LSD.test(value, Treatment, df, MSerror, group=FALSE)
    comparison
    bar.err(comparison,std=TRUE,ylim=c(0,45),density=4,border="blue")


    So you can perform your post-hoc test.

    ReplyDelete
  3. Sorry, before the comparison, you have to attach your data:

    attach(mydata.m)

    I hope this help you.

    ReplyDelete
  4. Sir, it works so nicely, thanks, the reason I wanted to use is this, I wanted to do my data entry in MS excel as I wanted multiple columns, each column as a replication.
    Treatment Rep1 Rep2 Rep3
    Tridemorph (0.1%) 33.89 32.45 35.67

    Sir I have another request, in the graph I would be happy if each bar has the treatment name as its label, now I am seeing only 3 labels,
    thanking you very much, yours sincerely,
    Dr.D.K.Samuel, Plant Virologist, Ind Inst of Horticultural Research, Bangalore - 89, India

    ReplyDelete
  5. Hi Samuel,
    A question: are you performing a comparison between Rep1, Rep2, Rep3; or are you performing a comparison between Treatments, considering only one Rep?
    Because you cannot perform an ANOVA considering replication AND treatment simultaneously: it is a "repeated measures ANOVA", which in R has another syntax.

    However the bar.err() function plots the name of the factor you are considering in default.

    (maybe, if you want, send at my e-mail your Excel file with all data, so I can have a better idea of your problem)

    ReplyDelete
  6. Hi Sir ,

    I have the same test dataset organized in two different formats. I tried to do single factor ANOVA on both the formats which resulted two different results for P value. Can you please help me to know where the problem is? I am afraid I did something wrong in declaring the factor variable.

    Dataset 1

    110 17 11 10
    13 12 7 12
    14 18 9 8
    13 13 13 7

    Dataset 2
    10 1
    13 1
    14 1
    13 1
    17 2
    12 2
    18 2
    13 2
    11 3
    7 3
    9 3
    13 3
    10 4
    12 4
    8 4
    7 4

    I followed this procedure for Dataset 1 in R following the blog

    > test1 <- read.table("c://anova/dataset1.txt")
    > names(test1) = c ("low" , "mod","high","veryhigh")
    > attach(test1)
    > merge <- c(low , mod , high , veryhigh)
    > groups <- factor(rep(letters[1:4], each = 4))
    > fit <- lm(formula = merge ~ groups)
    > anova(fit)

    Analysis of Variance Table

    Response: merge
    Df Sum Sq Mean Sq F value Pr(>F)
    groups 3 2119.2 706.4 1.199 0.3518
    Residuals 12 7069.8 589.1

    I followed this procedure for Dataset 2 in R follwing (http://www.agr.kuleuven.ac.be/vakken/statisticsbyR/ANOVAbyRr/multiplecomp.htm)

    > test2 <- read.table ("c://anova/dataset2.txt")
    > names(test2) = c("data", "Brand")
    > attach(test2)
    > Brand <- factor(Brand)
    > fit <- lm(formula = data ~ Brand)
    > anova(fit)

    Analysis of Variance Table

    Response: data
    Df Sum Sq Mean Sq F value Pr(>F)
    Brand 3 81.688 27.229 4.6846 0.02176 *
    Residuals 12 69.750 5.812
    ---
    This will help me to do ANOVA in the orginal dataset for my analysis.

    Many Thanks ,

    Rodney , Australia
    Rodney

    ReplyDelete
  7. Hi Rodney,
    the two methods for declaring groups are totally equivalent, and you write the right code.
    There is only an error:

    Group 1 of dataset 1 is:
    110, 13, 14, 13
    Group 1 of dataset 2 is:
    10, 13, 14, 13

    Of course the two ANOVA tables are different (different variables). If you correct the first variable, you will obtain the same ANOVA table.

    :)

    ReplyDelete
  8. Hi Sir ,

    It was really a silly mistake. Thank you very much for pointing it out.

    Best Regards,

    Rodney

    ReplyDelete
  9. I have a doubt: I´ve run an anova of 2 factors with only one observation per cell. In this case one can´t test the homogeneity of variances due to lack of observations in each group.However, if you run a bartlett test, you get a result. What does it do in this case? I can´t understand

    ReplyDelete
  10. Hi Sir
    Would it be possible to run the Flingner-Killeen test on the data from the four stores if one of the managers had forgotten to report their consumption of kilowatts one month? what would be the r-code for the groups be then?
    Best regards

    ReplyDelete
  11. Thank you for this nice tutorial. Another good ressources on ANOVA can be found at http://www.sthda.com/english/wiki/one-way-anova-test-in-r

    ReplyDelete