Saturday, September 5, 2009

Polynomial regression techniques

Suppose we want to create a polynomial that can approximate better the following dataset on the population of a certain Italian city over 10 years. The table summarizes the data:

$$\begin{tabular}{|1|1|}\hline Year & Population\\ \hline 1959&4835\\ 1960&4970\\ 1961&5085\\ 1962&5160\\ 1963&5310\\ 1964&5260\\ 1965&5235\\ 1966&5255\\ 1967&5235\\ 1968&5210\\ 1969&5175\\ \hline \end{tabular}$$

First we import the data into R:


Year <- c(1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969)
Population <- c(4835, 4970, 5085, 5160, 5310, 5260, 5235, 5255, 5235, 5210, 5175)


Now we create the dataframe named sample1:


sample1 <- data.frame(Year, Population)
sample1

Year Population
1 1959 4835
2 1960 4970
3 1961 5085
4 1962 5160
5 1963 5310
6 1964 5260
7 1965 5235
8 1966 5255
9 1967 5235
10 1968 5210
11 1969 5175


At this point may be useful to chart these values, to observe the trend and take an idea of the final polynomial function. For convenience we modify the column Year, creating a neighborhood of zero, thus:


sample1$Year <- sample1$Year - 1964
sample1

Year Population
1 -5 4835
2 -4 4970
3 -3 5085
4 -2 5160
5 -1 5310
6 0 5260
7 1 5235
8 2 5255
9 3 5235
10 4 5210
11 5 5175


Put the values on a chart


plot(sample1$Year, sample1$Population, type="b")




At this point we can start with the search for a polynomial model that adequately approximates our data. First, we specify that we want a polynomial function of X, ie a raw polynomial , is different from the orthogonal polynomial. This is an important addition because the controls and the results will change in the two cases R. So we want a function of X like:

$$f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+ ... +\beta_nx^n$$

At what degree of the polynomial stop? Depends on the degree of precision that we seek. The greater the degree of the polynomial, the greater the accuracy of the model, but the greater the difficulty in calculating; we must also verify the significance of coefficients that are found. But let's get straight to the point.

In R for fitting a polynomial regression model (not orthogonal), there are two methods, among them identical. Suppose we seek the values of beta coefficients for a polynomial of degree 1, then 2nd degree, and 3rd degree:


fit1 <- lm(sample1$Population ~ sample1$Year)
fit2 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2))
fit3 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3))


Or we can write more quickly, for polynomials of degree 2 and 3:


fit2b <- lm(sample1$Population ~ poly(sample1$Year, 2, raw=TRUE))
fit3b <- lm(sample1$Population ~ poly(sample1$Year, 3, raw=TRUE))


The function poly is useful if you want to get a polynomial of high degree, because it avoids explicitly write the formula. If we specify raw=TRUE, the two methods provide the same output, but if we do not specify raw=TRUE (or rgb(153, 0, 0);">raw=F), the function poly give us the values of the beta parameters of an orthogonal polynomials, which is different from the general formula I wrote above, although the models are both effective.

Let's look at the output.


summary(fit2)
## or summary(fit2b)

Call:
lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2))

Residuals:
Min 1Q Median 3Q Max
-46.888 -18.834 -3.159 2.040 86.748

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5263.159 17.655 298.110 < 2e-16 ***
sample1$Year 29.318 3.696 7.933 4.64e-05 ***
I(sample1$Year^2) -10.589 1.323 -8.002 4.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 38.76 on 8 degrees of freedom
Multiple R-squared: 0.9407, Adjusted R-squared: 0.9259
F-statistic: 63.48 on 2 and 8 DF, p-value: 1.235e-05


The output of summary(fit2b) is the same. We obtained the values of beta0 (5263,159), beta1 (29,318) and beta2 (-10,589), which appear to be significant AII 3. The equation of polynomial of degree 2 of our model is:

$$f(x)=5263.1597+29.318x-10.589x^2$$

If we want a polynomial of 3rd degree, we have:


summary(fit3)
## of summary(fit3b)

Call:
lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2) +
I(sample1$Year^3))

Residuals:
Min 1Q Median 3Q Max
-32.774 -14.802 -1.253 3.199 72.634

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5263.1585 15.0667 349.324 4.16e-16 ***
sample1$Year 14.3638 8.1282 1.767 0.1205
I(sample1$Year^2) -10.5886 1.1293 -9.376 3.27e-05 ***
I(sample1$Year^3) 0.8401 0.4209 1.996 0.0861 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 33.08 on 7 degrees of freedom
Multiple R-squared: 0.9622, Adjusted R-squared: 0.946
F-statistic: 59.44 on 3 and 7 DF, p-value: 2.403e-05


The equation is:

$$f(x)=5263.1585+14.3638x-10.5886x^2+0.8401x^3$$

In the latter case, however, the coefficients beta1 and beta3 are not significant, then the best model is the polynomial of 2nd degree. Furthermore look at the Multiple R-squared: in the 2nd degree model it is 94.07%, while in the 3rd degree model it is 96.22%. It seems that there has been an increase in accuracy of the model, but it is a significant increase? We can compare the two model with an ANOVA table:


anova(fit2, fit3)

Analysis of Variance Table

Model 1: sample1$Population ~ sample1$Year + I(sample1$Year^2)
Model 2: sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 12019.8
2 7 7659.5 1 4360.3 3.9848 0.0861 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Since the p-value is greater than 0.05, we accept the null hypothesis: there wasn't a significant improvement of the model.

The biggest problem now is to represent graphically the result. In fact, R does not exist (as far as I know) a function for plotting polynomials found. We must therefore proceed with graphic artifacts still valid, but somewhat laborious.

First, we plotted the values, with the command seen before. This time only display the lines and not points, for convenience graphics:


plot(sample1$Year, sample1$Population, type="l", lwd=3)


Now add to this chart the progress of the 2nd degree polynomial, in this way:


points(sample1$Year, predict(fit2), type="l", col="red", lwd=2)


The function predict() compute the Y values given the X values. The the coordinates are linked with lines. Is not plotted the continuous, but the discrete. With a few values, this method is highly debilitating.



Let's add the graph of the polynomial of 3rd degree:


points(sample1&Year, predict(fit3), type="l", col="blue", lwd=2)




As you can see the two models have very similar trends.

If we would instead obtain the graph of continuous functions obtained, we proceed in this manner. First you create the polynomial equation we previously found:


pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]


Remember that:
- coefficient[1] = beta0
- coefficient[2] = beta1
- coefficient[3] = beta2
and so on.

At this point we plotted the coordinates of sample1 and then the created curve with curve(x):


plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
curve(pol2, col="red", lwd=2)


The point, however, disappear, but we can replace them with the command points:


points(sample1$Year, sample1$Population, type="p", lwd=3)


A note: you must follow the order of commands as I have described, otherwise the function curve creates a wrong graph. So summarizing the commands to get the continuous function, and the experimental points on the same graph are the following:


plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
curve(pol2, col="red", lwd=2)
points(sample1$Year, sample1$Population, type="p", lwd=3)


The graph we get is the following:



Now draw the graph of the polynomial of 3rd degree:


plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol3 <- function(x) fit3$coefficient[4]*x^3 + fit3$coefficient[3]*x^2 + fit3$coefficient[2]*x + fit3$coefficient[1]
curve(pol3, col="red", lwd=2)
points(sample1$Year, sample1$Population, type="p", lwd=3)