## Saturday, September 5, 2009

### Polynomial regression techniques

Suppose we want to create a polynomial that can approximate better the following dataset on the population of a certain Italian city over 10 years. The table summarizes the data:

$$\begin{tabular}{|1|1|}\hline Year & Population\\ \hline 1959&4835\\ 1960&4970\\ 1961&5085\\ 1962&5160\\ 1963&5310\\ 1964&5260\\ 1965&5235\\ 1966&5255\\ 1967&5235\\ 1968&5210\\ 1969&5175\\ \hline \end{tabular}$$

First we import the data into R:

Year <- c(1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969)Population <- c(4835, 4970, 5085, 5160, 5310, 5260, 5235, 5255, 5235, 5210, 5175)

Now we create the dataframe named sample1:

sample1 <- data.frame(Year, Population)sample1   Year Population1  1959       48352  1960       49703  1961       50854  1962       51605  1963       53106  1964       52607  1965       52358  1966       52559  1967       523510 1968       521011 1969       5175

At this point may be useful to chart these values, to observe the trend and take an idea of the final polynomial function. For convenience we modify the column Year, creating a neighborhood of zero, thus:

sample1$Year <- sample1$Year - 1964sample1   Year Population1    -5       48352    -4       49703    -3       50854    -2       51605    -1       53106     0       52607     1       52358     2       52559     3       523510    4       521011    5       5175

Put the values on a chart

plot(sample1$Year, sample1$Population, type="b")

At this point we can start with the search for a polynomial model that adequately approximates our data. First, we specify that we want a polynomial function of X, ie a raw polynomial , is different from the orthogonal polynomial. This is an important addition because the controls and the results will change in the two cases R. So we want a function of X like:

$$f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+ ... +\beta_nx^n$$

At what degree of the polynomial stop? Depends on the degree of precision that we seek. The greater the degree of the polynomial, the greater the accuracy of the model, but the greater the difficulty in calculating; we must also verify the significance of coefficients that are found. But let's get straight to the point.

In R for fitting a polynomial regression model (not orthogonal), there are two methods, among them identical. Suppose we seek the values of beta coefficients for a polynomial of degree 1, then 2nd degree, and 3rd degree:

fit1 <- lm(sample1$Population ~ sample1$Year)fit2 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2))fit3 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3)) Or we can write more quickly, for polynomials of degree 2 and 3: fit2b <- lm(sample1$Population ~ poly(sample1$Year, 2, raw=TRUE))fit3b <- lm(sample1$Population ~ poly(sample1$Year, 3, raw=TRUE)) The function poly is useful if you want to get a polynomial of high degree, because it avoids explicitly write the formula. If we specify raw=TRUE, the two methods provide the same output, but if we do not specify raw=TRUE (or rgb(153, 0, 0);">raw=F), the function poly give us the values of the beta parameters of an orthogonal polynomials, which is different from the general formula I wrote above, although the models are both effective. Let's look at the output. summary(fit2)## or summary(fit2b)Call:lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2))Residuals:    Min      1Q  Median      3Q     Max -46.888 -18.834  -3.159   2.040  86.748 Coefficients:                  Estimate Std. Error t value Pr(>|t|)    (Intercept)       5263.159     17.655 298.110  < 2e-16 ***sample1$Year 29.318 3.696 7.933 4.64e-05 ***I(sample1$Year^2)  -10.589      1.323  -8.002 4.36e-05 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 38.76 on 8 degrees of freedomMultiple R-squared: 0.9407,     Adjusted R-squared: 0.9259 F-statistic: 63.48 on 2 and 8 DF,  p-value: 1.235e-05

The output of summary(fit2b) is the same. We obtained the values of beta0 (5263,159), beta1 (29,318) and beta2 (-10,589), which appear to be significant AII 3. The equation of polynomial of degree 2 of our model is:

$$f(x)=5263.1597+29.318x-10.589x^2$$

If we want a polynomial of 3rd degree, we have:

summary(fit3)## of summary(fit3b)Call:lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3))Residuals:    Min      1Q  Median      3Q     Max -32.774 -14.802  -1.253   3.199  72.634 Coefficients:                   Estimate Std. Error t value Pr(>|t|)    (Intercept)       5263.1585    15.0667 349.324 4.16e-16 ***sample1$Year 14.3638 8.1282 1.767 0.1205 I(sample1$Year^2)  -10.5886     1.1293  -9.376 3.27e-05 ***I(sample1$Year^3) 0.8401 0.4209 1.996 0.0861 . ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 33.08 on 7 degrees of freedomMultiple R-squared: 0.9622, Adjusted R-squared: 0.946 F-statistic: 59.44 on 3 and 7 DF, p-value: 2.403e-05  The equation is: $$f(x)=5263.1585+14.3638x-10.5886x^2+0.8401x^3$$ In the latter case, however, the coefficients beta1 and beta3 are not significant, then the best model is the polynomial of 2nd degree. Furthermore look at the Multiple R-squared: in the 2nd degree model it is 94.07%, while in the 3rd degree model it is 96.22%. It seems that there has been an increase in accuracy of the model, but it is a significant increase? We can compare the two model with an ANOVA table: anova(fit2, fit3)Analysis of Variance TableModel 1: sample1$Population ~ sample1$Year + I(sample1$Year^2)Model 2: sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3)  Res.Df     RSS Df Sum of Sq      F Pr(>F)  1      8 12019.8                             2      7  7659.5  1    4360.3 3.9848 0.0861 .---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since the p-value is greater than 0.05, we accept the null hypothesis: there wasn't a significant improvement of the model.

The biggest problem now is to represent graphically the result. In fact, R does not exist (as far as I know) a function for plotting polynomials found. We must therefore proceed with graphic artifacts still valid, but somewhat laborious.

First, we plotted the values, with the command seen before. This time only display the lines and not points, for convenience graphics:

plot(sample1$Year, sample1$Population, type="l", lwd=3)

Now add to this chart the progress of the 2nd degree polynomial, in this way:

points(sample1$Year, predict(fit2), type="l", col="red", lwd=2) The function predict() compute the Y values given the X values. The the coordinates are linked with lines. Is not plotted the continuous, but the discrete. With a few values, this method is highly debilitating. Let's add the graph of the polynomial of 3rd degree: points(sample1&Year, predict(fit3), type="l", col="blue", lwd=2) As you can see the two models have very similar trends. If we would instead obtain the graph of continuous functions obtained, we proceed in this manner. First you create the polynomial equation we previously found: pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]

Remember that:
- coefficient[1] = beta0
- coefficient[2] = beta1
- coefficient[3] = beta2
and so on.

At this point we plotted the coordinates of sample1 and then the created curve with curve(x):

plot(sample1$Year, sample1$Population, type="p", lwd=3)pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]curve(pol2, col="red", lwd=2) The point, however, disappear, but we can replace them with the command points: points(sample1$Year, sample1$Population, type="p", lwd=3) A note: you must follow the order of commands as I have described, otherwise the function curve creates a wrong graph. So summarizing the commands to get the continuous function, and the experimental points on the same graph are the following: plot(sample1$Year, sample1$Population, type="p", lwd=3)pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]curve(pol2, col="red", lwd=2)points(sample1$Year, sample1$Population, type="p", lwd=3)

The graph we get is the following:

Now draw the graph of the polynomial of 3rd degree:

plot(sample1$Year, sample1$Population, type="p", lwd=3)pol3 <- function(x) fit3$coefficient[4]*x^3 + fit3$coefficient[3]*x^2 + fit3$coefficient[2]*x + fit3$coefficient[1]curve(pol3, col="red", lwd=2)points(sample1$Year, sample1$Population, type="p", lwd=3)