Friday, August 7, 2009

Two-way analysis of variance: two-way ANOVA in R

The one-way analysis of variance is a useful technique to verify if the means of more groups are equals. But this analysis may not be very useful for more complex problems. For example, it may be necessary to take into account two factors of variability to determine if the averages between the groups depend on the group classification ( "zone") or the second variable that is to consider ("block"). In this case should be used the two-way analysis of variance (two-way ANOVA).

We begin immediately with an example so as to facilitate the understanding of this statistical method. The data collected are organized into double entry tables.

The director of a company has collected revenue (thousand dollars) for 5 years and under per month. You want to see if the revenue depends on the year and/or month, or if they are independent of these two factors.

Conceptually, the problem may be solved by an horizontal ANOVA and a vertical ANOVA, in order to verify if the average revenues per year are the same, and if they are equal to the average revenue computed by month. This would require many calculations, and so we prefer to use the two-way ANOVA, which provides the result instantly.
This is the table of revenue classified by year and month:

$$\begin{tabular}{|c||ccccc||r|}\hline Months & Year 1 & Year 2 & Year 3 & Year 4 & Year 5\\\hline January&15&18&22&23&24\\ February&22&25&15&15&14\\ March&18&22&15&19&21\\ April&23&15&14&17&18\\ May&23&15&26&18&14\\ June&12&15&11&10&8\\ July&26&12&23&15&18\\ August&19&17&15&20&10\\ September&15&14&18&19&20\\ October&14&18&10&12&23\\ November&14&22&19&17&11\\ December&21&23&11&18&14\\ \hline \end{tabular}$$

As with the one-way ANOVA, even here the aim is to structure a Fisher's F-test to assess the significance of the variable "month" and of the variable "year", determine if the revenues are dependent on one (or both) the criteria for classification.
How to perform the two-way ANOVA in R? First creates an array containing all the values tabulated, transcribed by rows:


revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21,
23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18,
19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11,
21,23,11,18,14)


According to the months, you create a factor of 12 levels (the number of rows) with 5 repetitions (the number columns) in this manner:


months = gl(12,5)


According to the years you create a factor with 5 levels (the number of column) and 1 recurrence, declaring the total number of observations (the length of the vector revenue):


years = gl(5, 1, length(entrate))


Now you can fit the linear model and produce the ANOVA table:


fit = aov(revenue ~ months + years)

anova(fit)

Analysis of Variance Table

Response: revenue
Df Sum Sq Mean Sq F value Pr(>F)
months 11 308.45 28.04 1.4998 0.1660
years 4 44.17 11.04 0.5906 0.6712
Residuals 44 822.63 18.70


Now interpret the results.
The significance of the difference between months is: F = 1.4998. This value is lower than the value tabulated and indeed p-value > 0.05. So we accept the null hypothesis: the means of revenue evaluated according to the months are equal, then the variable "months" has no effect on revenue.

The significance of the difference between years is: F = 0.5906. This value is lower than the value tabulated and indeed p-value > 0.05. So we accept the null hypothesis: the means of revenue evaluated according to the years are equal, then the variable "years" has no effect on revenue.

2 comments:

  1. Hello,

    I'm trying to calculate the proportion of the total variance explained by each factor in a two-way ANOVA (where in my case each factor has a significant effect and I'm not considering an interaction term).

    From what I've read, this is calculated using the following formula (formulated according to the variables in your example above):

    e.g. for 'months',
    estimated variance=(mean squares for months - residual mean squares)/(no. levels of 'years' x no. replicates in each combination of 'years' and 'months')

    Then you take this value as a proportion of the total variance:

    'proportion of total variance explained by 'months' = est. variance for 'months' /(est. variance for 'months' + est. variance for 'years' + residual variance)

    where the residual variance is equal to the residual mean squares.

    Assuming this is correct (I could be confused!) I can calculate it by hand easily enough. But what I want to know is:

    is there a command in R to calculate directly the estimated variance component of each factor in the model? Or to include this value in the anova table? Or is this method considered too old-fashioned?

    thank you!

    ReplyDelete
  2. Hi Antechinus,
    probably you're right when you say that including that data in an ANOVA table is "old-fashioned" :)

    As far as I know, there aren't functions in R that print the value you're looking for. However, as you know, there are A LOT of packages in R, so probably some of these contains a more exhaustive function, but at this moment I don't remember anything like your request, sorry :)

    ReplyDelete