Tuesday, August 25, 2009

Web-site trend analysis with data from Google Analytics

This post is a summary of my two previous posts on the trend analysis with the Cox-Stuart test and on simple linear regression. The goal that we propose is to assess the trend in the number of visits received from a site over a long time. I use Google Analytics, because this tool allows us to save the various reports in Excel CSV format. Let's see, step by step, how to save the reportage, and then how to import data from Excel to R, and finally how to estimate if the number of daily visitors follows an increasing or decreasing trend.

Let's start by creating an ad hoc report in Google Analytics. Once you have logged in, select the date range that we want to analyze. Then click onVisits.

At this point we can save the report, clicking on Export and then clicking on CSV for Excel.

Save the CSV file, and open it with Excel. Here's how it seems:

Now import the data into R. Import data from Excel to R is very simple. Simply select the column (or columns) of our interest (in our case the column Visits) and copy in the clipboard with CTRL + C (remember to select the cell Visits, because it will be useful):

Then open R and type the following command:

myvisit <- read.delim("clipboard")myvisit   Visits1      332      413      344      455      466      377      318      379      3410     3411     4812     3913     33...

It is a one column dataframe; the name of the column is Visits (so it is importat to select the header from Excel).

Now we can proceed with the analysis of trends in the two proposed ways: through a Cox-Stuart test e through the analysis of the simple linear regression.

The function to perform the Cox-Stuart test is available here. First we must convert the dataframe in a format that can be read by the function cox.stuart.test, like this:

visits <- c(myvisit$Visits) I have created in this way, a vector (visits) that contains all data that were ordered in the column Visits of the dataframe myvisit. Now we provide a test of Cox-Stuart: cox.stuart.test(visits) Cox-Stuart test for trend analysisdata: Increasing trend, p-value = 0.0012 The output is very clear: We have detected an increasing trend of visits, highly significant (since p-value < 0.5). If we are not satisfied or sure of this result, we can take into account the slope of the regression line. Firstly may want to show the results. The vector contains the hits daily visits to the site. Now we create a sorted array of the days in question, the same length of the carrier hits: days <- c(1 : length(visits)) Create a plot: plot(days, visits, type="b") Choosing type="b" I see dots and lines, as shown in figure: From this plot is not easy to observe a possible trend of the progress of visits. We can still do a regression analysis. Evaluating the sign of the slope of the line, we can estimate whether the trend is increasing or decreasing: fit <- lm(visits ~ days)summary(fit)Call:lm(formula = visits ~ days)Residuals: Min 1Q Median 3Q Max -22.714 -6.197 -1.313 5.648 31.153 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.79694 2.27151 13.998 < 2e-16 ***days 0.19815 0.04242 4.671 1.04e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.81 on 90 degrees of freedomMultiple R-squared: 0.1951, Adjusted R-squared: 0.1862 F-statistic: 21.82 on 1 and 90 DF, p-value: 1.043e-05  The slope coefficient has a value of: b = 0.06251. It therefore has a positive sign, then one may think of an increasing trend. The value of the statistical t-test on the slope, and its relative p-value, indicate either that it is significant. We can therefore say that there is an increasing trend. Finally, we can see the regression line directly on the plot previously obtained in this way: plot(days, visits, type="b")abline(fit, col="red", lwd=3) The command abline allows us to add a line defined by the equation given, directly on the chart shown; the parameter "col" specifies the color and the "lwd" parameter specifies the thickness of the line. Observe now the graph: It's obvious that there is an increasing trend, as said by the Cox-Stuart test. Thursday, August 20, 2009 Simple logistic regression on qualitative dichotomic variables In this post we will see briefly how to implement a logistic regression model if you have categorical variables, or qualitative, organized in double entry contingency tables. In this model the dependent variable (Y) and independent variable (X) are both dichotomies (or Bernoullian). In general, the probability that Y = 1 as a function of predictors is: $$P(Y=1|X=x)=\pi(x)=\frac{exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}{1+exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}$$ Our goal is to estimate the value of the beta parameters (regressors). We begin to examine a model of simple logistic regression (with only one predictor). Consider the following example. The table below shows the results of a study on gastroesophageal reflux. You want to evaluate how the presence of a stress factor can influence the onset of this disease. First we import the values in R. We must create a table with double entry; proceed as follows: reflux <- matrix(c(251,131,4,33), nrow=2)colnames(reflux) <- c("reflNO", "reflYES")rownames(reflux) <- c("stressNO", "stressYES")table <- as.table(reflux)table reflNO reflYESstressNO 251 4stressYES 131 33 Now adjust the data for the logistic regression. We must create a data frame: dft <- as.data.frame(table)dft Var1 Var2 Freq1 stressNO reflNO 2512 stressYES reflNO 1313 stressNO reflYES 44 stressYES reflYES 33 We can now fit the model, and then perform the logistic regression in R: fit <- glm(Var2 ~ Var1, weights = Freq, data = dft, family = binomial(logit))summary(fit)Call:glm(formula = Var2 ~ Var1, family = binomial(logit), data = dft, weights = Freq)Deviance Residuals: 1 2 3 4 -2.817 -7.672 5.765 10.287 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.1392 0.5040 -8.213 < 2e-16 ***Var1stressYES 2.7605 0.5403 5.109 3.23e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 250.23 on 3 degrees of freedomResidual deviance: 205.86 on 2 degrees of freedomAIC: 209.86Number of Fisher Scoring iterations: 6 First we comment the code to perform the regression. The logistic regression is called imposing the family: family = binomial(logit). The code Var2 ~ Var1 means that we want to create a model that will explain the variable var2 (presence or absence of reflux) as a function of the variable var1 (presence or absence of stressful events). In practice var2 is the independent variable Y, and Var1 is the dependent variable X (the regressors). Provided the formula to be analyzed, you specify the weight of each variable, data in column Freq of the dataframe dft (so we write weights = Freq and data = dft to specify the location where the values are contained). The values of the parameters$\beta_0$and$\beta_1$are respectively the values (intercept) and Var1stress1. We can then write our empirical model: $$\pi(x)=\frac{exp(-4.139+2.760x)}{1+exp(-4.139+2.760x)}$$ The independent variable x can be zero or one. If you assume value 0 (ie in the absence of stressful events), then the probability of having reflux is: $$\pi(x=0)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=0.016=1.6\%$$ If there are stressful events (x = 1), the probability of having reflux is: $$\pi(x=1)=\frac{exp(\beta_0+\beta_1)}{1+exp(\beta_0+\beta_1)}=0.20=20\%$$ The odds are: $$odds(x=1)=\frac{\pi(1)}{1-\pi(1)}=exp(\beta_0+\beta_1)$$ $$odds(x=0)=\frac{\pi(0)}{1-\pi(0)}=exp(\beta_0)$$ We can finally calculate the odd ratio OR: $$OR=\frac{odds(x=1)}{odds(x=0)}=15.807$$ A person who has experienced a stressful event has a propensity to develop gastroesophageal reflux 15.807 times larger than the person who has not undergone stressful events. The probabilities and the odds can be readily calculated in R recalling that: fit$coefficient[1] = $\beta_0$ (intercept)
fit$coefficient[2] =$\beta_1$Furthermore: summary(fit)$coefficient[1,2] = standard error of $\beta_0$
summary(fit)$coefficient[2,2] = standard error of$\beta_1$And so we have: pi0 <- exp(fit$coefficient[1]) / (1 + exp(fit$coefficient[1]))pi1 <- exp(fit$coefficient[1] + fit$coefficient[2]) / (1 + exp(fit$coefficient[1]+fit$coefficient[2]))odds0 <- pi0 / (1 - pi0)odds1 <- pi1 / (1 - pi1)OR <- odds1 / odds0#the same result with:OR <- exp(fit$coefficient[2])#the confidence interval for OR is:ORmin <- exp( fit$coefficient[2] - qnorm(.975) * summary(fit)$coefficient[2,2] )ORmax <- exp( fit$coefficient[2] + qnorm(.975) * summary(fit)$coefficient[2,2] )

We can obtain the same result for the odd-ratio, using the simplify formula:

$$OR=\frac{ad}{bc}=\frac{251\cdot33}{4\cdot131}=15.807$$

that in R is:

OR <- (table[1,1]*table[2,2]) / (table[1,2]*table[2,1])

The acronym AIC stands for Akaike's information criterion. This parameter does not provide any data on the model just created. It
may be useful in comparing this model with other possibly taken into account (the model with lowest AIC is the better).

Saturday, August 8, 2009

Trend Analysis with the Cox-Stuart test in R

The Cox-Stuart test is defined as a little powerful test (power equal to 0.78), but very robust for the trend analysis. It is therefore applicable to a wide variety of situations, to get an idea of the evolution of values obtained. The proposed method is based on the binomial distribution. In R there is no function to perform a test of Cox-Stuart, so now we see the logical steps that are the basis of test and finally we can write the function ourself.

You want to assess whether there is an increasing or decreasing trend of the number of daily customers of a restaurant. We have the number of customers in 15 days:
Customers: 5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14

To perform the test of Cox-Stuart, the number of observations must be even. In our case we have 15 observations. Delete, therefore, the observation at position (N+1)/2 (here the observation with value = 20):

customers = c(5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14)length(customers)[1] 15cust_even = customers[ -(length(customers)+1)/2 ]length(cust_even)[1] 14

Now we have 14 observations, and we can then proceed. Divide the observations into two vectors, the first containing the first half of the measures, and the second the second half:

fHalf = cust_even[1:7]sHalf = cust_even[8:14]fHalf[1]  5  9 12 18 17 16 19sHalf[1]  4  3 18 16 17 15 14

Now subtract, value by value, the content of the two vectors:

difference = fHalf - sHalfdifference[1]  1  6 -6  2  0  1  5

Now consider only the signs of the contents of the vector difference

signs = sign(difference)signs[1]  1  1 -1  1  0  1  1

A difference has value 0 and therefore also in the vector with the signs there is a value equal to 0. This must be eliminated:

signs = signs[ signs != 0 ]signs[1]  1  1 -1  1  1  1

We obtained six differences, and then six signs. Now we have to count the number of positive-signs and the number of negative-signs:

pos = signs[signs > 0]neg = signs[signs < 0]length(pos)[1] 5length(neg)[1] 1

Now we choose the number of signs that is smaller. In this case we choose the number of negative signs (1). We compute the probability to obtain x = 1 successes on N = 6 experiments, each of which yields success with probability p = 0.5 (binomial distribution):

pbinom(1, 6, 0.5)[1] 0.109375

The value so calculated is higher than 0.05 (we choose a significance level of 95%). Therefore there is no significant trend (which would have been in decline since the number of negative signs is minor).
If the value was less than 0.05, we accepted the hypothesis of a significant trend.

Now try to fit a regression model, and observe the p-value of the slope: the coefficient b is not significant.

customers = c(5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14)days <- c(1:length(customers))model <- lm(customers ~ days)summary(model)Call:lm(formula = customers ~ days)Residuals:    Min      1Q  Median      3Q     Max -11.090  -2.173   1.352   3.967   6.467 Coefficients:            Estimate Std. Error t value Pr(>|t|)   (Intercept)  11.3048     3.1104   3.634  0.00303 **days          0.2786     0.3421   0.814  0.43014   ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.724 on 13 degrees of freedom

Here is the code to perform a Cos-Stuart test, written by me.

cox.stuart.test =function (x){  method = "Cox-Stuart test for trend analysis"  leng = length(x)  apross = round(leng) %% 2  if (apross == 1) {    delete = (length(x)+1)/2    x = x[ -delete ]   }  half = length(x)/2  x1 = x[1:half]  x2 = x[(half+1):(length(x))]  difference = x1-x2  signs = sign(difference)  signcorr = signs[signs != 0]  pos = signs[signs>0]  neg = signs[signs<0]  if (length(pos) < length(neg)) {    prop = pbinom(length(pos), length(signcorr), 0.5)    names(prop) = "Increasing trend, p-value"    rval <- list(method = method, statistic = prop)    class(rval) = "htest"    return(rval)  }  else {    prop = pbinom(length(neg), length(signcorr), 0.5)    names(prop) = "Decreasing trend, p-value"    rval <- list(method = method, statistic = prop)    class(rval) = "htest"    return(rval)  }}

We can now use the function just created:

customers = c(5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14)cox.stuart.test(customers)        Cox-Stuart test for trend analysisdata:  Decreasing trend, p-value = 0.1094

Friday, August 7, 2009

Two-way analysis of variance: two-way ANOVA in R

The one-way analysis of variance is a useful technique to verify if the means of more groups are equals. But this analysis may not be very useful for more complex problems. For example, it may be necessary to take into account two factors of variability to determine if the averages between the groups depend on the group classification ( "zone") or the second variable that is to consider ("block"). In this case should be used the two-way analysis of variance (two-way ANOVA).

We begin immediately with an example so as to facilitate the understanding of this statistical method. The data collected are organized into double entry tables.

The director of a company has collected revenue (thousand dollars) for 5 years and under per month. You want to see if the revenue depends on the year and/or month, or if they are independent of these two factors.

Conceptually, the problem may be solved by an horizontal ANOVA and a vertical ANOVA, in order to verify if the average revenues per year are the same, and if they are equal to the average revenue computed by month. This would require many calculations, and so we prefer to use the two-way ANOVA, which provides the result instantly.
This is the table of revenue classified by year and month:

$$\begin{tabular}{|c||ccccc||r|}\hline Months & Year 1 & Year 2 & Year 3 & Year 4 & Year 5\\\hline January&15&18&22&23&24\\ February&22&25&15&15&14\\ March&18&22&15&19&21\\ April&23&15&14&17&18\\ May&23&15&26&18&14\\ June&12&15&11&10&8\\ July&26&12&23&15&18\\ August&19&17&15&20&10\\ September&15&14&18&19&20\\ October&14&18&10&12&23\\ November&14&22&19&17&11\\ December&21&23&11&18&14\\ \hline \end{tabular}$$

As with the one-way ANOVA, even here the aim is to structure a Fisher's F-test to assess the significance of the variable "month" and of the variable "year", determine if the revenues are dependent on one (or both) the criteria for classification.
How to perform the two-way ANOVA in R? First creates an array containing all the values tabulated, transcribed by rows:

revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21,          23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18,          19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11,          21,23,11,18,14)

According to the months, you create a factor of 12 levels (the number of rows) with 5 repetitions (the number columns) in this manner:

months = gl(12,5)

According to the years you create a factor with 5 levels (the number of column) and 1 recurrence, declaring the total number of observations (the length of the vector revenue):

years = gl(5, 1, length(entrate))

Now you can fit the linear model and produce the ANOVA table:

fit = aov(revenue ~ months + years)anova(fit)Analysis of Variance TableResponse: revenue          Df Sum Sq Mean Sq F value Pr(>F)months    11 308.45   28.04  1.4998 0.1660years      4  44.17   11.04  0.5906 0.6712Residuals 44 822.63   18.70

Now interpret the results.
The significance of the difference between months is: F = 1.4998. This value is lower than the value tabulated and indeed p-value > 0.05. So we accept the null hypothesis: the means of revenue evaluated according to the months are equal, then the variable "months" has no effect on revenue.

The significance of the difference between years is: F = 0.5906. This value is lower than the value tabulated and indeed p-value > 0.05. So we accept the null hypothesis: the means of revenue evaluated according to the years are equal, then the variable "years" has no effect on revenue.

Thursday, August 6, 2009

Simple linear regression

We use the regression analysis when, from the data sample, we want to derive a statistical model that predicts the values of a variable (Y, dependent) from the values of another variable (X, independent). The linear regression, which is the simplest and most frequent relationship between two quantitative variables, can be positive (when X increase, Y increase too) or negative (when X increase, Y decrease): this is indicated by the sign of the coefficient b.

To build the line that describes the distribution of points, we might refer to different principles. The most common is the least squares method (or Model I), and this is the method used by the statistical software R.

Suppose you want to obtain a linear relationship between weight (kg) and height (cm) of 10 subjects.
Height: 175, 168, 170, 171, 169, 165, 165, 160, 180, 186
Weight: 80, 68, 72, 75, 70, 65, 62, 60, 85, 90

The first problem is to decide what is the dependent variable Y and waht is the independent variable X. In general, the independent variable is not affected by an error during the measurement (or affected by random error), while the dependent variable is affected by error. In our case we can assume that the variable weight is the independent variable (X), and the dependent variable height (Y).
So our problem is to find a linear relationship (formula) that allows us to calculate the height, known as the weight of an individual. The simplest formula is that of a broad line of type Y = a + bX. The simple regression line in R is calculated as follows:

height = c(175, 168, 170, 171, 169, 165, 165, 160, 180, 186)weight = c(80, 68, 72, 75, 70, 65, 62, 60, 85, 90) model = lm(formula = height ~ weight, x=TRUE, y=TRUE)modelCall:lm(formula = height ~ weight, x = TRUE, y = TRUE)Coefficients:(Intercept)       weight     115.2002       0.7662

The correct syntax of the formula stated in lm is: Y ~ X, then you declare first the dependent variable, and after the independent variable (or variables).
The output of the function is represented by two parameters a and b: a=115.2002 (intercept), b=0.7662 (the slope).

The simple calculation of the line is not enough. We must assess the significance of the line, ie if the slopeb differs from zero significantly. This may be done with a Student's t.test or with a Fisher's F-test.
In R both can be retrieved very quickly, with the function summary(). Here's how:

model <- lm(height ~ weight)summary(model)Call:lm(formula = height ~ weight)Residuals:    Min      1Q  Median      3Q     Max -1.6622 -0.9683 -0.1622  0.5679  2.2979 Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 115.20021    3.48450   33.06 7.64e-10 ***weight        0.76616    0.04754   16.12 2.21e-07 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.405 on 8 degrees of freedomMultiple R-squared: 0.9701,     Adjusted R-squared: 0.9664 F-statistic: 259.7 on 1 and 8 DF,  p-value: 2.206e-07

Here too there are the values of the parameters a and b.
The Student's t-test on the slope in this case has the value 16.12; the Student's t-test on the intercept has value 16.12; the value of the Fisher's F test is 259.7 (is the same value would be achieved by performing an ANOVA on the same data: anova(model)). The p-values of the t-tests and the F-test are less then 0.05, so the model we found is significant.
The Multiple R-squared is the coefficient of determination. It provides a measure of how well future outcomes are likely to be predicted by the model. In this case, the 97.01% of the data are well predicted (with 95% of significance) by our model.

We can plot on a graph the data points and the regression line, in this way:

plot(weight, height)abline(model)

Wednesday, August 5, 2009

Contingency table and the study of the correlation between qualitative variables: Pearson's Chi-squared test

If you have qualitative variable, it is possible to verify the correlation by studying a contingency table R by C, using the Pearson's Chi-squared test.

A casino wants to study the correlation between the modes of play and the number of winners by age group, to see if the number of winners depends on the type of game that you chose to do, in light of experience. It has the following data (number of winners / 100 player for game and age-group):

$$\begin{tabular}{c|ccc}&Age\\\hline Game&20-30&31-40&41-50\\ \hline Roulette&44&56&55\\ Black-jack& 66& 88& 23\\Poker& 15& 29& 45 \end{tabular}$$

In R, we must first build a matrix with the data collected:

table <- matrix(c(44,56,55, 66,88,23, 15,29,45), nrow=3, byrow=TRUE)

Now we can compute the chi-squared correlation coefficient:

chisq.test(table)        Pearson's Chi-squared testdata:  table X-squared = 46.0767, df = 4, p-value = 2.374e-09

I reject the null hypothesis H0 in favor of the alternative hypothesis (p-value < 0.05): there is a strong correlation between the age of the player and his probability to win.

Tuesday, August 4, 2009

Non-parametric methods for the study of the correlation: Spearman's rank correlation coefficient and Kendall tau rank correlation coefficient

We saw in the previous post, how to study the correlation between variables that follow a Gaussian distribution with the Pearson product-moment correlation coefficient. If it is not possible to assume that the values follow gaussian distributions, we have two non-parametric methods: the Spearman's rho test and Kendall's tau test.

For example, you want to study the productivity of various types of machinery and the satisfaction of operators in their use (as with a number from 1 to 10). These are the values:
Productivity: 5, 7, 9, 9, 8, 6, 4, 8, 7, 7
Satisfaction: 6, 7, 4, 4, 8, 7, 3, 9, 5, 8

Begin to use first the Spearman's rank correlation coefficient:

a <- c(5, 7, 9, 9, 8, 6, 4, 8, 7, 7)b <- c(6, 7, 4, 4, 8, 7, 3, 9, 5, 8)cor.test(a, b, method="spearman")        Spearman's rank correlation rhodata:  a and b S = 145.9805, p-value = 0.7512alternative hypothesis: true rho is not equal to 0 sample estimates:      rho 0.1152698

The statistical test gives us as a result rho = 0.115, which indicates a low correlation (not parametric) between the two sets of values.
The p-value > 0.05 allows us to accept the value of rho calculated, being statistically significant.

Now we check the same data with the Kendall tau rank correlation coefficient:

a <- c(5, 7, 9, 9, 8, 6, 4, 8, 7, 7)b <- c(6, 7, 4, 4, 8, 7, 3, 9, 5, 8) cor.test(a, b, method="kendall")        Kendall's rank correlation taudata:  a and b z = 0.5555, p-value = 0.5786alternative hypothesis: true tau is not equal to 0 sample estimates:     tau 0.146385

Even with the Kendall test, the correlation is very low (tau = 0.146), and significant (p-value > 0.05).

Monday, August 3, 2009

Parametric method for the study of the correlation: the Pearson r-test

Suppose you want to study whether there is a correlation between 2 sets of data. To do this we compute the Pearson product-moment correlation coefficient, which is a measure of the correlation (linear dependence) between two variables X and Y; then we compute the value of a t-test to study the significance of the Pearson coefficient R. We can use this test when the data follow a Gaussian distribution.

A new test to measure IQ is subjected to 10 volunteers. You want to see if there is a correlation between the new experimental test and the classical test, in order to replace the old test with the new test. These the values:
Old test: 15, 21, 25, 26, 30, 30, 22, 29, 19, 16
New test: 55, 56, 89, 67, 84, 89, 99, 62, 83, 88

The software R has a single function, easily recalled, which gives us directly the value of the Pearson coefficient and the t-statistical test for checking the significance of the coefficient:

a = c(15, 21, 25, 26, 30, 30, 22, 29, 19, 16)b = c(55, 56, 89, 67, 84, 89, 99, 62, 83, 88)cor.test(a, b)        Pearson's product-moment correlationdata:  a and b t = 0.4772, df = 8, p-value = 0.646alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.5174766  0.7205107 sample estimates:     cor 0.166349

The value of the coefficient of Pearson is 0.166: it is a very low value, which indicates a poor correlation between the variables.
Furthermore, the p-value is greater than 0.05; so we reject the null hypothesis: then the Pearson coefficient is significant.
So we can say that there is no correlation between the results of both tests.