Showing posts with label Simple linear regression. Show all posts
Showing posts with label Simple linear regression. Show all posts

Tuesday, August 25, 2009

Web-site trend analysis with data from Google Analytics

This post is a summary of my two previous posts on the trend analysis with the Cox-Stuart test and on simple linear regression. The goal that we propose is to assess the trend in the number of visits received from a site over a long time. I use Google Analytics, because this tool allows us to save the various reports in Excel CSV format. Let's see, step by step, how to save the reportage, and then how to import data from Excel to R, and finally how to estimate if the number of daily visitors follows an increasing or decreasing trend.

Let's start by creating an ad hoc report in Google Analytics. Once you have logged in, select the date range that we want to analyze. Then click onVisits.



At this point we can save the report, clicking on Export and then clicking on CSV for Excel.



Save the CSV file, and open it with Excel. Here's how it seems:



Now import the data into R. Import data from Excel to R is very simple. Simply select the column (or columns) of our interest (in our case the column Visits) and copy in the clipboard with CTRL + C (remember to select the cell Visits, because it will be useful):

Then open R and type the following command:


myvisit <- read.delim("clipboard")

myvisit

Visits
1 33
2 41
3 34
4 45
5 46
6 37
7 31
8 37
9 34
10 34
11 48
12 39
13 33
...



It is a one column dataframe; the name of the column is Visits (so it is importat to select the header from Excel).

Now we can proceed with the analysis of trends in the two proposed ways: through a Cox-Stuart test e through the analysis of the simple linear regression.

The function to perform the Cox-Stuart test is available here. First we must convert the dataframe in a format that can be read by the function cox.stuart.test, like this:


visits <- c(myvisit$Visits)


I have created in this way, a vector (visits) that contains all data that were ordered in the column Visits of the dataframe myvisit. Now we provide a test of Cox-Stuart:


cox.stuart.test(visits)

Cox-Stuart test for trend analysis

data:
Increasing trend, p-value = 0.0012


The output is very clear: We have detected an increasing trend of visits, highly significant (since p-value < 0.5).




If we are not satisfied or sure of this result, we can take into account the slope of the regression line. Firstly may want to show the results. The vector contains the hits daily visits to the site. Now we create a sorted array of the days in question, the same length of the carrier hits:


days <- c(1 : length(visits))


Create a plot:


plot(days, visits, type="b")


Choosing type="b" I see dots and lines, as shown in figure:



From this plot is not easy to observe a possible trend of the progress of visits. We can still do a regression analysis. Evaluating the sign of the slope of the line, we can estimate whether the trend is increasing or decreasing:


fit <- lm(visits ~ days)
summary(fit)

Call:
lm(formula = visits ~ days)

Residuals:
Min 1Q Median 3Q Max
-22.714 -6.197 -1.313 5.648 31.153

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.79694 2.27151 13.998 < 2e-16 ***
days 0.19815 0.04242 4.671 1.04e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.81 on 90 degrees of freedom
Multiple R-squared: 0.1951, Adjusted R-squared: 0.1862
F-statistic: 21.82 on 1 and 90 DF, p-value: 1.043e-05


The slope coefficient has a value of: b = 0.06251. It therefore has a positive sign, then one may think of an increasing trend. The value of the statistical t-test on the slope, and its relative p-value, indicate either that it is significant. We can therefore say that there is an increasing trend.

Finally, we can see the regression line directly on the plot previously obtained in this way:


plot(days, visits, type="b")
abline(fit, col="red", lwd=3)


The command abline allows us to add a line defined by the equation given, directly on the chart shown; the parameter "col" specifies the color and the "lwd" parameter specifies the thickness of the line. Observe now the graph:



It's obvious that there is an increasing trend, as said by the Cox-Stuart test.

Thursday, August 6, 2009

Simple linear regression

We use the regression analysis when, from the data sample, we want to derive a statistical model that predicts the values of a variable (Y, dependent) from the values of another variable (X, independent). The linear regression, which is the simplest and most frequent relationship between two quantitative variables, can be positive (when X increase, Y increase too) or negative (when X increase, Y decrease): this is indicated by the sign of the coefficient b.

To build the line that describes the distribution of points, we might refer to different principles. The most common is the least squares method (or Model I), and this is the method used by the statistical software R.

Suppose you want to obtain a linear relationship between weight (kg) and height (cm) of 10 subjects.
Height: 175, 168, 170, 171, 169, 165, 165, 160, 180, 186
Weight: 80, 68, 72, 75, 70, 65, 62, 60, 85, 90


The first problem is to decide what is the dependent variable Y and waht is the independent variable X. In general, the independent variable is not affected by an error during the measurement (or affected by random error), while the dependent variable is affected by error. In our case we can assume that the variable weight is the independent variable (X), and the dependent variable height (Y).
So our problem is to find a linear relationship (formula) that allows us to calculate the height, known as the weight of an individual. The simplest formula is that of a broad line of type Y = a + bX. The simple regression line in R is calculated as follows:


height = c(175, 168, 170, 171, 169, 165, 165, 160, 180, 186)
weight = c(80, 68, 72, 75, 70, 65, 62, 60, 85, 90)

model = lm(formula = height ~ weight, x=TRUE, y=TRUE)
model

Call:
lm(formula = height ~ weight, x = TRUE, y = TRUE)

Coefficients:
(Intercept) weight
115.2002 0.7662


The correct syntax of the formula stated in lm is: Y ~ X, then you declare first the dependent variable, and after the independent variable (or variables).
The output of the function is represented by two parameters a and b: a=115.2002 (intercept), b=0.7662 (the slope).




The simple calculation of the line is not enough. We must assess the significance of the line, ie if the slopeb differs from zero significantly. This may be done with a Student's t.test or with a Fisher's F-test.
In R both can be retrieved very quickly, with the function summary(). Here's how:


model <- lm(height ~ weight)
summary(model)

Call:
lm(formula = height ~ weight)

Residuals:
Min 1Q Median 3Q Max
-1.6622 -0.9683 -0.1622 0.5679 2.2979

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 115.20021 3.48450 33.06 7.64e-10 ***
weight 0.76616 0.04754 16.12 2.21e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.405 on 8 degrees of freedom
Multiple R-squared: 0.9701, Adjusted R-squared: 0.9664
F-statistic: 259.7 on 1 and 8 DF, p-value: 2.206e-07



Here too there are the values of the parameters a and b.
The Student's t-test on the slope in this case has the value 16.12; the Student's t-test on the intercept has value 16.12; the value of the Fisher's F test is 259.7 (is the same value would be achieved by performing an ANOVA on the same data: anova(model)). The p-values of the t-tests and the F-test are less then 0.05, so the model we found is significant.
The Multiple R-squared is the coefficient of determination. It provides a measure of how well future outcomes are likely to be predicted by the model. In this case, the 97.01% of the data are well predicted (with 95% of significance) by our model.

We can plot on a graph the data points and the regression line, in this way:


plot(weight, height)
abline(model)