Statistic on aiR

Adjustment for Multiple Comparison Tests with R: Resources on the web

2015-01-20T19:51:00.000+01:00

1. Bonferroni correction

p.adjust(p, method = "bonferroni")

Read: http://en.wikipedia.org/wiki/

2. Sidak (Dunn-Sidak) correction

Read: http://en.wikipedia.org/wiki/

3. Holm-Bonferroni correction

p.adjust(p, method = "holm")

Read: http://en.wikipedia.org/wiki/

4. Hochberg correction

p.adjust(p, method = "hochberg")

Read: http://stats.stackexchange.com/questions/
Read: http://onbiostatistics.blogspot.it/

5. Hommel correction

p.adjust(p, method = "hommel")

Read: http://stats.stackexchange.com/questions

6. Benjamini-Hochberg correction

p.adjust(p, method = "BH")
or equivalently
p.adjust(p, method = "fdr")

Read: http://nebc.nerc.ac.uk/courses/
Read: http://en.wikipedia.org/wiki/

7. Benjamini–Yekutieli (Benjamini–Hochberg–Yekutieli) correction

p.adjust(p, method = "BY")

Read: http://en.wikipedia.org/wiki/

ggPlot2: Histogram with jittered stripchart

2014-02-05T20:50:00.000+01:00

Here is an example of a Histogram plot, with a stripchart (vertically jittered) along the x side of the plot.

Alternatively, using the geom_rug function:

Of course this simplicistic method need to be adjusted in vertical position of the stripchart or rugchart (y=-2, here), and the relative proportion of points jittering.

Boxplot with mean and standard deviation in ggPlot2 (plus Jitter)

2014-02-02T13:14:00.000+01:00

When you create a boxplot in R, it automatically computes median, first and third quartile ("hinges") and 95% confidence interval of median ("notches").

But we would like to change the default values of boxplot graphics with the mean, the mean + standard deviation, the mean - S.D., the min and the max values.

Here is an example solved using ggplot2 package. Plus here are represented points (the single values) jittered horizontally.

Implementation of the CDC Growth Charts in R

2011-09-17T22:44:00.009+02:00

I implemented in R a function to re-create the CDC Growth Chart, according to the data provided by the CDC.

In order to use this function, you need to download the .rar file available at this megaupload link.

Mirror: mediafire link.

Then unrar the file, and put the Growth folder in your main directory, as selected in R. You are now able to use the two functions i'm going to illustrate.

growthFun.R

The function growthFun allows you to draw 8 different growth chart, which are different for Male and Female (sixteen in total).
The only parameters you need to input are:
sex = c("m", "f")
type = c("wac36", "lac36", "wlc", "hac", "wsc", "wac20", "lac20", "bac")
The explanation for the type's parameters code are in the first part of the function code.
Eventually you can modify the pat variable, if you want to put the Growth folder in another place (not in the main directory of R).

I reccomend to use the pdf() graphic device for best resolution.

Hese is an example of the output you can obtain, with the following code:

pdf("hac_example.pdf", paper="a4", width=0, height=0)
growthFun("m", "hac")
dev.off()

MygrowthFun.R

The function MygrowthFun allows you to personalize the output of the previous function, with specific patient's data.
The parameters you can modify are:

 sex=c("m", "f")
type=c("wac36", "lac36", "wlc", "hac", "wsc", "wac20", "lac20", "bac", "bmi.adv")
path="./Growth/"
name = NULL
surname = NULL
birth_date = NULL
mydataAA = NULL

The three parameter sex, type and path are the same of the growthFun function. The three parameters name, surname and birth_date refer to the patient's data; you can add this data in form of character().
mydataAA is an optional parameters with the values measured on your patients during the time you follow up him. Generally you need to input this data in form of a data.frame().
In the type parameter there is an additional choice: bmi.adv allows you to obtain three chart (wac20, lac20, bac - see the explanation codes), if your mydataAA dataframe contains data about Stature and Weight during the time of follow up.

Details.
Let's see the format of mydataAA, according to the type of chart you want to graph.

type = wac36
mydataAA: 
first column = months of measurement, from 0 to 36
second column = weight (in kg)

type = lac36
mydataAA: 
first column = months of measurement, from 0 to 36
second column = length (in cm)

type = hac
mydataAA: 
first column = months of measurement, from 0 to 36
second column = head circumference (in cm)

type = wac20
mydataAA: 
first column = months of measurement, from 24 to 240 (from 2 to 20 years)
second column = weight (in kg)

type = lac20
mydataAA: 
first column = months of measurement, from 24 to 240 (from 2 to 20 years)
second column = stature (in cm)

type = bmi.adv
mydataAA: 
first column (months) = months of measurement, from 24 to 240 (from 2 to 20 years)
second column (stature) = stature (in cm)
third column (weight)= weight (in kg)

In the last type it's not importat the order of the columns, but here are important their names.

Examples.
Let's see some example. Suppose that you are following the growth of a new born (her name is Alyssa Gigave, born on 07/08/2009), and you collect the following data:

Months  Length
0       50
2       55
3       56
5       61
8       71
9       72
12      75
15      75
18      81
21      89
26      91
27      94
30      95
35      98

So you can create your personalized graph in this way:

alyssa_data <- data.frame(   months=c(0, 2, 3, 5, 8, 9, 12, 15, 18, 21, 26, 27, 30, 35),   length=c(50, 55, 56, 61, 71, 72, 75, 75, 81, 89, 91, 94, 95, 98))  pdf("alyssa_growth_chart.pdf", paper="a4", width=0, height=0)  MygrowthFun(sex="f", type="lac36", name="Alyssa", surname="Gigave", birth_date="july 08, 2009", mydataAA=alyssa_data)  dev.off()

The output is the following pdf file:

Now suppose that you're a pediatric doctor, and that you follow a boy (Tommy Cigalino, born on 07/08/1980). Whenever he has come to you, you collect his weight and stature, and the months from his birth he was. So you have the following data:

  months stature weight
     25      98     17
     31     100     21
     34     102     27
     35     104     29
     58     106     30
     60     109     32
     70     111     33
     85     118     34
     88     119     36
     89     120     39
     91     121     42
    102     126     45
    107     128     47
    108     135     49
    120     144     51
    134     145     52
    154     148     54
    166     152     55
    169     157     62
    170     158     63
    178     163     64
    179     167     68
    181     168     71
    219     169     74
    234     176     76

So you can create three graphs (wac20, lac20, bac), using the bmi.adv type:

tommy_data <- data.frame(  months = c( 25, 31, 34, 35, 58, 60,               70, 85, 88, 89, 91, 102,               107, 108, 120, 134, 154,               166, 169, 170, 178, 179,               181, 219, 234),   stature = c( 98, 100, 102, 104, 106,               109, 111, 118, 119, 120,               121, 126, 128, 135, 144,               145, 148, 152, 157, 158,               163, 167, 168, 169, 176),   weight = c( 17, 21, 27, 29, 30, 32,               33, 34, 36, 39, 42, 45,               47, 49, 51, 52, 54, 55,               62, 63, 64, 68, 71, 74,               76))  pdf("tommy_growth_chart.pdf", paper="a4", width=0, height=0)  MygrowthFun(sex="m", type="bmi.adv", name="Tommy", surname="Cigalino", birth_date="july 08, 1980", mydataAA=tommy_data)  dev.off()

Tommaso MARTINO, 17/09/2011

REFERENCES

http://www.cdc.gov/growthcharts/cdc_charts.htm

http://www.cdc.gov/growthcharts/clinical_charts.htm

http://www.cdc.gov/growthcharts/percentile_data_files.htm

Kuczmarski RJ, Ogden CL, Guo, SS, et al. CDC growth charts for the United States: Methods and Development. Vital Health Stat; 11 (246) National Center for Health Statistics. 2002.

R is a cool sound editor!

2011-09-07T16:43:00.007+02:00

Capabilities of R are definitely unless! After my previous posts about some easy image editing in R (they are here, and here), now is the time to explore if R is capable of sound editing!

Just for fun, here I created a function that receives a phone number (or another sequence of numbers), and returns the equivalent melody you can listen if you press that sequence on your house' phone... =D

It requires the sound library, and here's the code.

Now you can simply create your phone melody =)

s2 <- PlayTel("556c885a4623#")

You can listen to it with the command:

play(s2)

(NOTE: in Windows 7 I was unable to find a wave player that works on batch mode - i.e. mplay32.exe. So this command doesn't work on Windows 7. It works on Windows XP)

You can save the output using the command:

saveSample(s2, "tel.wav")

(This command works on Windows 7)

Here is an example of the output:

Have fun!! =)

R is a cool image editor #2: Dithering algorithms

2011-08-29T11:11:00.002+02:00

Here I implemented in R some dithering algorithms:
- Floyd-Steinberg dithering
- Bill Atkinson dithering
- Jarvis-Judice-Ninke dithering
- Sierra 2-4a dithering
- Stucki dithering
- Burkes dithering
- Sierra2 dithering
- Sierra3 dithering

For each algorithm, I wrote a 2-dimensional convolution function (a matrix passing over a matrix); it is slow because I didn't implemented any fasting tricks. It can be easily implemented in C, then used in R for a faster solution.
Then, a function to transform a grey image in a grey-dithered image is provided, with an example. The library rimage was used for loading and displaying images (see the other post R is a cool image editor).
These function can be easily re-coded for a RGB image.
Only the first code is commented, 'cause they're all very similar.


library(rimage)

y <- read.jpeg("valve.jpg")

plot(y)

Floyd-Steinberg dithering

plot(normalize(grey2FSdith(rgb2grey(y))))

Bill Atkinson dithering

plot(normalize(grey2ATKdith(rgb2grey(y))))

Jarvis-Judice-Ninke dithering

plot(normalize(grey2JJNdith(rgb2grey(y))))

Sierra 2-4a dithering filter

plot(normalize(grey2S24adith(rgb2grey(y))))

Stucki dithering

plot(normalize(grey2Stucki(rgb2grey(y))))

Burkes dithering

plot(normalize(grey2Burkes(rgb2grey(y))))

Sierra2 dithering

plot(normalize(grey2Sierra2(rgb2grey(y))))

Sierra3 dithering

plot(normalize(grey2Sierra3(rgb2grey(y))))

Benford's law, or the First-digit law

2011-08-25T23:30:00.007+02:00

Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time.
Wikipedia, retrieved 08/25/2011

R simulation:



library(MASS)

benford <- function(m, n){

list <- c()



# compute all m^n, for n= 1, 2, ..., i, ..., n

for(i in 1:n){

list[i] <- m^i

}



# a function to extract the first digit from a number

bben <- function(k){

as.numeric(head(strsplit(as.character(k),'')[[1]],n=1))

}



# extract the first digit from all numbers computed

first.digit <- sapply(list, bben)



# plot frequency of first digits

truehist(first.digit, nbins=10, main=m)

}



par(mfrow=c(2,2))

benford(2,1000)

benford(3,640) # if n is greater, it returns "inf" (on my pc)

benford(4,500)

benford(5,440)

How to plot points, regression line and residuals

2011-06-16T09:52:00.002+02:00


x <- c(173, 169, 176, 166, 161, 164, 160, 158, 180, 187)
y <- c(80, 68, 72, 75, 70, 65, 62, 60, 85, 92)

# plot scatterplot and the regression line
mod1 <- lm(y ~ x)
plot(x, y, xlim=c(min(x)-5, max(x)+5), ylim=c(min(y)-10, max(y)+10))
abline(mod1, lwd=2)



# calculate residuals and predicted values
res <- signif(residuals(mod1), 5)
pre <- predict(mod1)

# plot distances between points and the regression line
segments(x, y, x, pre, col="red")

# add labels (res values) to points
library(calibrate)
textxy(x, y, res, cx=0.7)

R is a cool image editor!

2010-11-07T11:40:00.003+01:00

Here I present some functions I wrote to recreate some of the most common image effect available in all image editor.
They require the library rimage.
To load the image, use:

y <- read.jpeg("path")

To display the image, use:

plot(y)

Original image

Sepia tone

rgb2sepia <- function(img){
 iRed <- img[,,1]*255
 iGreen <- img[,,2]*255
 iBlue <- img[,,3]*255
 
 oRed <- iRed * .393 + iGreen * .769 + iBlue * .189
 oGreen <- iRed * .349 + iGreen * .686 + iBlue * .168
 oBlue <- iRed * .272 + iGreen * .534 + iBlue * .131
 
 qw <- array( c(oRed/255 , oGreen/255 , oBlue/255), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2sepia(y))

Negative

rgb2neg <- function(img){
 iRed <- img[,,1]
 iGreen <- img[,,2]
 iBlue <- img[,,3]
 
 oRed <- (1 - iRed)
 oGreen <- (1 - iGreen)
 oBlue <- (1 - iBlue)
 
 qw <- array( c(oRed, oGreen, oBlue), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2neg(y))

Pixelation

pixmatr <- function(a, n){
 aa <- seq(1,dim(a)[1],n)
 ll <- seq(1,dim(a)[2],n)
 
 for(i in 1:(length(aa)-1) ){
  for(j in 1:(length(ll)-1) ){
   sub1 <- a[aa[i]:(aa[i+1]-1),ll[j]:(ll[j+1]-1)]
   k <- mean(sub1)
   sub1m <- matrix( rep(k, n*n), n, n)
   a[aa[i]:(aa[i+1]-1),ll[j]:(ll[j+1]-1)] <- sub1m
   }
  }
 
 for(j in 1:(length(ll)-1) ){
  sub1 <- a[max(aa):dim(a)[1],ll[j]:(ll[j+1]-1)]
  k <- mean(sub1)
  sub1m <- matrix( rep(k, nrow(sub1)*ncol(sub1)), nrow(sub1), ncol(sub1))
  a[max(aa):dim(a)[1],ll[j]:(ll[j+1]-1)] <- sub1m
 }
 
 for(i in 1:(length(aa)-1) ){
  sub1 <- a[aa[i]:(aa[i+1]-1),max(ll):dim(a)[2]]
  k <- mean(sub1)
  sub1m <- matrix( rep(k, nrow(sub1)*ncol(sub1)), nrow(sub1), ncol(sub1))
  a[aa[i]:(aa[i+1]-1),max(ll):dim(a)[2]] <- sub1m
 }
 
 sub1 <- a[max(aa):dim(a)[1], max(ll):dim(a)[2]]
 k <- mean(sub1)
 sub1m <- matrix( rep(k, nrow(sub1)*ncol(sub1)), nrow(sub1), ncol(sub1))
 a[max(aa):dim(a)[1], max(ll):dim(a)[2]] <- sub1m
 
a
}
 
rgb2pix <- function(img,n){
 iRed <- img[,,1]*255
 iGreen <- img[,,2]*255
 iBlue <- img[,,3]*255
 
 oRed <- pixmatr(iRed,n)
 oGreen <- pixmatr(iGreen,n)
 oBlue <- pixmatr(iBlue,n)
 
 qw <- array( c(oRed/255 , oGreen/255 , oBlue/255), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2pix(y, 6))
plot(rgb2pix(y, 10))

Remove red

rgb2blu <- function(img){
 iRed <- img[,,1]
 iGreen <- img[,,2]
 iBlue <- img[,,3]
 
 oRed <- matrix(0, dim(iRed)[1], dim(iRed)[2])
 oGreen <- iGreen
 oBlue <- iBlue
 
 qw <- array( c(oRed, oGreen, oBlue), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2blu(y))

Remove green

rgb2vio <- function(img){
 iRed <- img[,,1]
 iGreen <- img[,,2]
 iBlue <- img[,,3]
 
 oRed <- iRed
 oGreen <- matrix(0, dim(iRed)[1], dim(iRed)[2])
 oBlue <- iBlue
 
 qw <- array( c(oRed, oGreen, oBlue), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2vio(y))

Remove blue

rgb2yel <- function(img){
 iRed <- img[,,1]
 iGreen <- img[,,2]
 iBlue <- img[,,3]
 
 oRed <- iRed
 oGreen <- iGreen
 oBlue <- matrix(0, dim(iRed)[1], dim(iRed)[2])
 
 qw <- array( c(oRed, oGreen, oBlue), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2yel(y))

Adjust brightness

rgb2bri <- function(img, n){
 iRed <- img[,,1]
 iGreen <- img[,,2]
 iBlue <- img[,,3]
 
 oRed <- iRed + (iRed * n)
 oGreen <- iGreen + (iGreen * n)
 oBlue <- iBlue + (iBlue * n)
 
 qw <- array( c(oRed, oGreen, oBlue), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2bri(y, +0.5))
plot(rgb2bri(y, -0.5))

Truncate colors into bands (posterize)

rgb2ban <- function(img, n){
 iRed <- img[,,1]*255
 iGreen <- img[,,2]*255
 iBlue <- img[,,3]*255
 
 band_size <- trunc(255/n)
 
 oRed <- band_size * trunc(iRed / band_size)
 oGreen <- band_size * trunc(iGreen / band_size)
 oBlue <- band_size * trunc(iBlue / band_size)
 
 qw <- array( c(oRed/255, oGreen/255, oBlue/255), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2ban(y, 5))
plot(rgb2ban(y, 10))

Solarize

rgb2sol <- function(img){
 iRed <- img[,,1]*255
 iGreen <- img[,,2]*255
 iBlue <- img[,,3]*255
 
 for(i in 1:dim(iRed)[1]){
  for(j in 1:dim(iRed)[2]){
   if(iRed[i,j]<128) iRed[i,j] <- 255-2*iRed[i,j]
   else iRed[i,j] <- 2*(iRed[i,j]-128)
  }
 }
 
 for(i in 1:dim(iGreen)[1]){
  for(j in 1:dim(iGreen)[2]){
   if(iGreen[i,j]<128) iGreen[i,j] <- 255-2*iGreen[i,j]
   else iGreen[i,j] <- 2*(iGreen[i,j]-128)
  }
 }
 
 for(i in 1:dim(iBlue)[1]){
  for(j in 1:dim(iBlue)[2]){
   if(iBlue[i,j]<128) iBlue[i,j] <- 255-2*iBlue[i,j]
   else iBlue[i,j] <- 2*(iBlue[i,j]-128)
  }
 }
 
 qw <- array( c(iRed/255, iGreen/255, iBlue/255), dim=c(dim(iRed)[1],dim(iRed)[2],3) )
 
 imagematrix(qw, type="rgb")
}
 
plot(rgb2sol(y))

Fast matrix inversion

2010-10-19T20:13:00.002+02:00

Very similar to what has been done to create a function to perform fast multiplication of large matrices using the Strassen algorithm (see previous post), now we write the functions to quickly calculate the inverse of a matrix.

To avoid rewriting pages and pages of comments and formulas, as I did for matrix multiplication, this time I'll show you directly the code of the function (the reasoning behind it is quite similar). Please, copy and paste all the code in an external editor to see it properly.

Function strassenInv(A)


strassenInv <- function(A){

 div4 <- function(A, r){
  A <- list(A)
  A11 <- A[[1]][1:(r/2),1:(r/2)]
  A12 <- A[[1]][1:(r/2),(r/2+1):r]
  A21 <- A[[1]][(r/2+1):r,1:(r/2)]
  A22 <- A[[1]][(r/2+1):r,(r/2+1):r]
  A <- list(X11=A11, X12=A12, X21=A21, X22=A22)
  return(A)
 }

        if (nrow(A) != ncol(A)) 
          { stop("only square matrices can be inverted") }

 is.wholenumber <-
     function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol

 if ( (is.wholenumber(log(nrow(A), 2)) != TRUE) || (is.wholenumber(log(ncol(A), 2)) != TRUE) )
   { stop("only square matrices of dimension 2^k * 2^k can be inverted with Strassen method") }

 A <- div4(A, dim(A)[1])

 R1 <- solve(A$X11)
 R2 <- A$X21 %*% R1
 R3 <- R1 %*% A$X12
 R4 <- A$X21 %*% R3
 R5 <- R4 - A$X22
 R6 <- solve(R5)
 C12 <- R3 %*% R6
 C21 <- R6 %*% R2
 R7 <- R3 %*% C21
 C11 <- R1 - R7
 C22 <- -R6
 
 C <- rbind(cbind(C11,C12), cbind(C21,C22))

 return(C)
}

Function strassenInv2(A)


strassenInv2 <- function(A){

 div4 <- function(A, r){
  A <- list(A)
  A11 <- A[[1]][1:(r/2),1:(r/2)]
  A12 <- A[[1]][1:(r/2),(r/2+1):r]
  A21 <- A[[1]][(r/2+1):r,1:(r/2)]
  A22 <- A[[1]][(r/2+1):r,(r/2+1):r]
  A <- list(X11=A11, X12=A12, X21=A21, X22=A22)
  return(A)
 }

 strassen <- function(A, B){
  A <- div4(A, dim(A)[1])
  B <- div4(B, dim(B)[1])
  M1 <- (A$X11+A$X22) %*% (B$X11+B$X22)
  M2 <- (A$X21+A$X22) %*% B$X11
  M3 <- A$X11 %*% (B$X12-B$X22)
  M4 <- A$X22 %*% (B$X21-B$X11)
  M5 <- (A$X11+A$X12) %*% B$X22
  M6 <- (A$X21-A$X11) %*% (B$X11+B$X12)
  M7 <- (A$X12-A$X22) %*% (B$X21+B$X22)

  C11 <- M1+M4-M5+M7
  C12 <- M3+M5
  C21 <- M2+M4
  C22 <- M1-M2+M3+M6
 
  C <- rbind(cbind(C11,C12), cbind(C21,C22))
  return(C)
 }

        if (nrow(A) != ncol(A)) 
          { stop("only square matrices can be inverted") }

 is.wholenumber <-
     function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol

 if ( (is.wholenumber(log(nrow(A), 2)) != TRUE) || (is.wholenumber(log(ncol(A), 2)) != TRUE) )
   { stop("only square matrices of dimension 2^k * 2^k can be inverted with Strassen method") }

 A <- div4(A, dim(A)[1])

 R1 <- strassenInv(A$X11)
 R2 <- strassen(A$X21 , R1)
 R3 <- strassen(R1 , A$X12)
 R4 <- strassen(A$X21 , R3)
 R5 <- R4 - A$X22
 R6 <- strassenInv(R5)
 C12 <- strassen(R3 , R6)
 C21 <- strassen(R6 , R2)
 R7 <- strassen(R3 , C21)
 C11 <- R1 - R7
 C22 <- -R6
 
 C <- rbind(cbind(C11,C12), cbind(C21,C22))

 return(C)
}

Function strassenInv3(A)


strassenInv3 <- function(A){

 div4 <- function(A, r){
  A <- list(A)
  A11 <- A[[1]][1:(r/2),1:(r/2)]
  A12 <- A[[1]][1:(r/2),(r/2+1):r]
  A21 <- A[[1]][(r/2+1):r,1:(r/2)]
  A22 <- A[[1]][(r/2+1):r,(r/2+1):r]
  A <- list(X11=A11, X12=A12, X21=A21, X22=A22)
  return(A)
 }

 strassen <- function(A, B){
  A <- div4(A, dim(A)[1])
  B <- div4(B, dim(B)[1])
  M1 <- (A$X11+A$X22) %*% (B$X11+B$X22)
  M2 <- (A$X21+A$X22) %*% B$X11
  M3 <- A$X11 %*% (B$X12-B$X22)
  M4 <- A$X22 %*% (B$X21-B$X11)
  M5 <- (A$X11+A$X12) %*% B$X22
  M6 <- (A$X21-A$X11) %*% (B$X11+B$X12)
  M7 <- (A$X12-A$X22) %*% (B$X21+B$X22)

  C11 <- M1+M4-M5+M7
  C12 <- M3+M5
  C21 <- M2+M4
  C22 <- M1-M2+M3+M6
 
  C <- rbind(cbind(C11,C12), cbind(C21,C22))
  return(C)
 }

 strassen2 <- function(A, B){
  A <- div4(A, dim(A)[1])
  B <- div4(B, dim(B)[1])
  M1 <- strassen((A$X11+A$X22) , (B$X11+B$X22))
  M2 <- strassen((A$X21+A$X22) , B$X11)
  M3 <- strassen(A$X11 , (B$X12-B$X22))
  M4 <- strassen(A$X22 , (B$X21-B$X11))
  M5 <- strassen((A$X11+A$X12) , B$X22)
  M6 <- strassen((A$X21-A$X11) , (B$X11+B$X12))
  M7 <- strassen((A$X12-A$X22) , (B$X21+B$X22))

  C11 <- M1+M4-M5+M7
  C12 <- M3+M5
  C21 <- M2+M4
  C22 <- M1-M2+M3+M6

  C <- rbind(cbind(C11,C12), cbind(C21,C22))
  return(C)
 }

        if (nrow(A) != ncol(A)) 
          { stop("only square matrices can be inverted") }

 is.wholenumber <-
     function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol

 if ( (is.wholenumber(log(nrow(A), 2)) != TRUE) || (is.wholenumber(log(ncol(A), 2)) != TRUE) )
   { stop("only square matrices of dimension 2^k * 2^k can be inverted with Strassen method") }

 A <- div4(A, dim(A)[1])

 R1 <- strassenInv2(A$X11)
 R2 <- strassen2(A$X21 , R1)
 R3 <- strassen2(R1 , A$X12)
 R4 <- strassen2(A$X21 , R3)
 R5 <- R4 - A$X22
 R6 <- strassenInv2(R5)
 C12 <- strassen2(R3 , R6)
 C21 <- strassen2(R6 , R2)
 R7 <- strassen2(R3 , C21)
 C11 <- R1 - R7
 C22 <- -R6
 
 C <- rbind(cbind(C11,C12), cbind(C21,C22))

 return(C)
}

We run now some test. First check if the function successfully invert the matrix and compare them with the results of the standard R function (Function solve()):


A <- matrix(trunc(rnorm(512*512)*100), 512,512)

all( round(solve(A),8) == round(strassenInv(A),8) )
[1] TRUE

all( round(solve(A),8) == round(strassenInv2(A),8) )
[1] TRUE

all( round(solve(A),6) == round(strassenInv3(A),6) )
[1] TRUE

The function performs the operations correctly. But there is a problem of approximation: in fact the first two functions are accurate to the eighth decimal place, while the third through sixth. Probably not an issue of calculus, but it is a problem of expression of numbers in binary format and 32-bit, which causes these errors.

Now we analyze the computation time. See in the table the result, obtained by running the following code:

Time computation


A <- matrix(trunc(rnorm(512*512)*100), 512,512)
system.time(solve(A))
system.time(strassenInv(A))
system.time(strassenInv2(A))
system.time(strassenInv3(A))

A <- matrix(trunc(rnorm(1024*1024)*100), 1024,1024)
system.time(solve(A))
system.time(strassenInv(A))
system.time(strassenInv2(A))
system.time(strassenInv3(A))

A <- matrix(trunc(rnorm(2048*2048)*100), 2048,2048)
system.time(solve(A))
system.time(strassenInv(A))
system.time(strassenInv2(A))
system.time(strassenInv3(A))

A <- matrix(trunc(rnorm(4096*4096)*100), 4096,4096)
system.time(solve(A))
system.time(strassenInv(A))
system.time(strassenInv2(A))
system.time(strassenInv3(A))

The results are quite obvious, and using a modification of Strassen algorithm for matrix inversion, there is a real time saving.

Please, remember these two recommendations already made:
- The code is to be improved, and if anyone wants to help me, I will be happy to update my code
- If you consider it useful to use these function for any work, a citation is always welcome (contact me at my e-mail for details)

Fast matrix multiplication in R: Strassen's algorithm

2010-10-18T15:31:00.005+02:00

I tried to implement the Strassen's algorithm for big matrices multiplication in R.

Here I present a pdf with some theory element, some example and a possible solution in R.
I'm not a programmer, so the function is not optimize, but it works.

I want to thank G. Grothendieck: suggested me a very nice way on StackOverFlow to create a bigger square matrix starting from small one.

This is just a first version of the function; it needs more work on it. If someone want to collaborate, I'll be very happy.
Finally if you find my code useful for your work, I'd love to be cited (ask me via e-mail how to cite me: todoslogos -at- gmail . com).

Convert decimal to IEEE-754 in R

2010-10-06T12:47:00.003+02:00

For some theory on the standard IEEE-754, you can read the Wikipedia page. Here I will post only the code of the function to make the conversion in R.

First we write some functions to convert decimal numbers to binary numbers:


decInt_to_8bit <- function(x, precs) {
q <- c()
r <- c()
xx <- c()
for(i in 1:precs){
xx[1] <- x
q[i] <- xx[i] %/% 2
r[i] <- xx[i] %% 2
xx[i+1] <- q[i]
}
rr <- rev(r)
return(rr)
}

devDec_to_8bit <- function(x, precs) {
nas <- c()
nbs <- c()
xxs <- c()
for(i in 1:precs)
{
xxs[1] <- x*2
nas[i] <- (xxs[i]) - floor(xxs[i])
nbs[i] <- trunc(xxs[i], 1)
xxs[i+1] <- nas[i]*2
}
return(nbs)
}

For example, in 8-bit:


decInt_to_8bit(11, 8)
[1] 0 0 0 0 1 0 1 1


devDec_to_8bit(0.625, 8)
[1] 1 0 1 0 0 0 0 0


devDec_to_8bit(0.3, 8)
[1] 0 1 0 0 1 1 0 0
devDec_to_8bit(0.3, 16)
[1] 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

We can delete the extra-zeros from the vectors, using these functions:


remove.zero.aft <- function(a) {
n <- length(a)
for(i in n:1){
if (a[n]==0) a <- a[-n]
else return(a)
n <- n-1
}
}

remove.zero.bef <- function(a) {
n <- length(a)
for(i in 1:n){
if (a[1]==0) a <- a[-1]
else return(a)
}
}

So we have:


remove.zero.bef(decInt_to_8bit(11, 8))
[1] 1 0 1 1

remove.zero.aft(devDec_to_8bit(0.625, 8))
[1] 1 0 1

Binding these functions, we have:


dec.to.nbit <- function(x,n) {
aa <- abs(trunc(x, 1))
bb <- abs(x) - abs(trunc(x))

q <- c()
r <- c()
xx <- c()
for(i in 1:n){
xx[1] <- aa
q[i] <- xx[i] %/% 2
r[i] <- xx[i] %% 2
xx[i+1] <- q[i]
}
rr <- rev(r)

nas <- c()
nbs <- c()
xxs <- c()
for(i in 1:n)
{
xxs[1] <- bb*2
nas[i] <- (xxs[i]) - floor(xxs[i])
nbs[i] <- trunc(xxs[i], 1)
xxs[i+1] <- nas[i]*2
}

bef <- paste(remove.zero.bef(rr), collapse="")
aft <- paste(remove.zero.aft(nbs), collapse="")
bef.aft <- c(bef, aft)
strings <- paste(bef.aft, collapse=".")
return(strings)
}

Example:


dec.to.nbit(11.625,8)
[1] "1011.101"

Now we can write the code for the decimal to IEEE-754 single float conversion in R:


dec.to.ieee754 <- function(x) {
aa <- abs(trunc(x, 1))
bb <- abs(x) - abs(trunc(x))

rr <- decInt_to_8bit(aa, 32)

ppc <- 24 - length(remove.zero.bef(rr))

nbs <- devDec_to_8bit(bb, ppc)

bef <- remove.zero.bef(rr)
aft <- remove.zero.aft(nbs)

exp <- length(bef) - 1
mantissa <- c(bef[-1], aft)

exp.bin <- decInt_to_8bit(exp + 127, 16)
exp.bin <- remove.zero.bef(exp.bin)

first <- c()
if (sign(x)==1) first=c(0)
if (sign(x)==-1) first=c(1)

ieee754 <- c(first, exp.bin, mantissa, rep(0, 23-length(mantissa)))
ieee754 <- paste(ieee754, collapse="")

return(ieee754)
}

The numbers 11.625 and 11.33 in IEEE-754 are:


dec.to.ieee754(11.625)
[1] "01000001001110100000000000000000"

dec.to.ieee754(11.33)
[1] "01000001001101010100011110101110"

You can verify the output with this Online Binary-Decimal Converter

Bhapkar V test

2010-04-28T18:22:00.004+02:00

This is the code to perform the Bhapkar V test. I've rapidly wrote it, in 2 hours. The code is then quite brutal and it could be done better. As soon as possible, I will correct it.

WARNING: it works *ONLY* with 3 groups, for now!


bhapkar.test.3g <- function(data1=list){

sample <- c()
for(i in 1:length(data1)){
sample <- c(sample, rep(i, length(data1[[i]])))
}

obs <- c()
for(i in 1:length(data1)){
obs <- c(obs, data1[[i]])
}
rank <- rank(obs)

cplets <- list()
vec <- c()
for(i in 1:length(data1[[1]])){
vec <- c(vec, (length(data1[[2]][data1[[2]]>data1[[1]][i]]) * length(data1[[3]][data1[[3]]>data1[[1]][i]])))
}
cplets[[1]] <- vec

vec <- c()
for(i in 1:length(data1[[2]])){
vec <- c(vec, (length(data1[[1]][data1[[1]]>data1[[2]][i]]) * length(data1[[3]][data1[[3]]>data1[[2]][i]])))
}
cplets[[2]] <- vec

vec <- c()
for(i in 1:length(data1[[3]])){
vec <- c(vec, (length(data1[[2]][data1[[2]]>data1[[3]][i]]) * length(data1[[1]][data1[[1]]>data1[[3]][i]])))
}
cplets[[3]] <- vec

cplets1 <- c(cplets[[1]], cplets[[2]], cplets[[3]])
mydata <- data.frame(obs=obs, sample=sample, rank=rank, cplets=cplets1)

v1 <- sum(cplets[[1]])
v2 <- sum(cplets[[2]])
v3 <- sum(cplets[[3]])

vtot <- v1+v2+v3
u1 <- v1/vtot
u2 <- v2/vtot
u3 <- v3/vtot
u <- c(u1,u2,u3)

lengths <- c(length(data1[[1]]), length(data1[[2]]), length(data1[[3]]))
N <- sum(lengths)
P <- c(lengths / N)
ngroup <- length(data1)

V <- N * (2*length(data1)-1)* (sum(P*((u-1/ngroup)^2)) - (sum(P*((u-1/ngroup))))^2)

prop <- pchisq(V, df=length(data1)-1)
names(V) = "V = "
method = "Bhapkar V-test"
rval <- list(method = method, statistic = V, p.value = prop)
class(rval) = "htest"
return(rval)



}

An example:


a <- c(42, 46, 48.5, 49, 68, 51)
b <- c(70.5, 54, 60,72)
c <- c(66, 54, 43, 105, 94)

mydata <- list(a,b,c)

bhapkar.test.3g(mydata)


        Bhapkar V-test

data:  
V = 6.7713, p-value = 0.9661

REFERENCES:
Statistical analysis of nonnormal data
By J. V. Deshpande, A. P. Gore, A. Shanubhogue
pag. 61

Latin squares design in R

2010-01-06T13:21:00.003+01:00

The Latin square design is used where the researcher desires to control the variation in an experiment that is related to rows and columns in the field.
Remember that:
* Treatments are assigned at random within rows and columns, with each treatment once per row and once per column.
* There are equal numbers of rows, columns, and treatments.
* Useful where the experimenter desires to control variation in two different directions

The formula used for this kind of three-way ANOVA are:

Source of variation	Degrees of freedom^a	Sums of squares (SSQ)	Mean square (MS)	F
Rows (R)	r-1	SSQ_R	SSQ_R/(r-1)	MS_R/MS_E
Columns (C)	r-1	SSQ_C	SSQ_C/(r-1)	MS_C/MS_E
Treatments (Tr)	r-1	SSQ_Tr	SSQ_Tr/(r-1)	MS_Tr/MS_E
Error (E)	(r-1)(r-2)	SSQ_E	SSQ_E/((r-1)(r-2))
Total (Tot)	r²-1	SSQ_Tot
^awhere r = number of (treatments=rows=columns).

Suppose you want to analyse the productivity of 5 kind on fertilizer, 5 kind of tillage, and 5 kind of seed. The data are organized in a latin square design, as follow:


             treatA  treatB  treatC  treatD  treatE
fertilizer1  "A42"   "C47"   "B55"   "D51"   "E44"         
fertilizer2  "E45"   "B54"   "C52"   "A44"   "D50"         
fertilizer3  "C41"   "A46"   "D57"   "E47"   "B48"         
fertilizer4  "B56"   "D52"   "E49"   "C50"   "A43"         
fertilizer5  "D47"   "E49"   "A45"   "B54"   "C46"

The three factors are: fertilizer (fertilizer1:5), tillage (treatA:E), seed (A:E). The numbers are the productivity in cwt / year.

Now create a dataframe in R with these data:


fertil <- c(rep("fertil1",1), rep("fertil2",1), rep("fertil3",1), rep("fertil4",1), rep("fertil5",1))
treat <- c(rep("treatA",5), rep("treatB",5), rep("treatC",5), rep("treatD",5), rep("treatE",5))
seed <- c("A","E","C","B","D", "C","B","A","D","E", "B","C","D","E","A", "D","A","E","C","B", "E","D","B","A","C")
freq <- c(42,45,41,56,47, 47,54,46,52,49, 55,52,57,49,45, 51,44,47,50,54, 44,50,48,43,46)
 
mydata <- data.frame(treat, fertil, seed, freq)

mydata

    treat  fertil seed freq
1  treatA fertil1    A   42
2  treatA fertil2    E   45
3  treatA fertil3    C   41
4  treatA fertil4    B   56
5  treatA fertil5    D   47
6  treatB fertil1    C   47
7  treatB fertil2    B   54
8  treatB fertil3    A   46
9  treatB fertil4    D   52
10 treatB fertil5    E   49
11 treatC fertil1    B   55
12 treatC fertil2    C   52
13 treatC fertil3    D   57
14 treatC fertil4    E   49
15 treatC fertil5    A   45
16 treatD fertil1    D   51
17 treatD fertil2    A   44
18 treatD fertil3    E   47
19 treatD fertil4    C   50
20 treatD fertil5    B   54
21 treatE fertil1    E   44
22 treatE fertil2    D   50
23 treatE fertil3    B   48
24 treatE fertil4    A   43
25 treatE fertil5    C   46

We can re-create the original table, using the matrix function:


matrix(mydata$seed, 5,5)

     [,1] [,2] [,3] [,4] [,5]
[1,] "A"  "C"  "B"  "D"  "E" 
[2,] "E"  "B"  "C"  "A"  "D" 
[3,] "C"  "A"  "D"  "E"  "B" 
[4,] "B"  "D"  "E"  "C"  "A" 
[5,] "D"  "E"  "A"  "B"  "C" 

matrix(mydata$freq, 5,5)

     [,1] [,2] [,3] [,4] [,5]
[1,]   42   47   55   51   44
[2,]   45   54   52   44   50
[3,]   41   46   57   47   48
[4,]   56   52   49   50   43
[5,]   47   49   45   54   46

Before proceeding with the analysis of variance of this Latin square design, you should perform a Boxplot, aimed to have an idea of what we expect:


par(mfrow=c(2,2))
plot(freq ~ fertil+treat+seed, mydata)

Note that the differences considering the fertilizer is low; it is medium considering the tillage, and is very high considering the seed.
Now confirm these graphics observations, with the ANOVA table:


myfit <- lm(freq ~ fertil+treat+seed, mydata)
anova(myfit)

Analysis of Variance Table

Response: freq
          Df  Sum Sq Mean Sq F value   Pr(>F)    
fertil     4  17.760   4.440  0.7967 0.549839    
treat      4 109.360  27.340  4.9055 0.014105 *  
seed       4 286.160  71.540 12.8361 0.000271 ***
Residuals 12  66.880   5.573                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Well, the boxplot was useful. Look at the significance of the F-test.
- The difference between group considering the fertilizer is not significant (p-value > 0.1);
- The difference between group considering the tillage is quite significant (p-value < 0.05);
- The difference between group considering the seed is very significant (p-value < 0.001);

Polynomial regression techniques

2009-09-05T20:26:00.000+02:00

Suppose we want to create a polynomial that can approximate better the following dataset on the population of a certain Italian city over 10 years. The table summarizes the data:

$$\begin{tabular}{|1|1|}\hline Year & Population\\ \hline 1959&4835\\ 1960&4970\\ 1961&5085\\ 1962&5160\\ 1963&5310\\ 1964&5260\\ 1965&5235\\ 1966&5255\\ 1967&5235\\ 1968&5210\\ 1969&5175\\ \hline \end{tabular}$$

First we import the data into R:


Year <- c(1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969)
Population <- c(4835, 4970, 5085, 5160, 5310, 5260, 5235, 5255, 5235, 5210, 5175)

Now we create the dataframe named sample1:


sample1 <- data.frame(Year, Population)
sample1

   Year Population
1  1959       4835
2  1960       4970
3  1961       5085
4  1962       5160
5  1963       5310
6  1964       5260
7  1965       5235
8  1966       5255
9  1967       5235
10 1968       5210
11 1969       5175

At this point may be useful to chart these values, to observe the trend and take an idea of the final polynomial function. For convenience we modify the column Year, creating a neighborhood of zero, thus:


sample1$Year <- sample1$Year - 1964
sample1

   Year Population
1    -5       4835
2    -4       4970
3    -3       5085
4    -2       5160
5    -1       5310
6     0       5260
7     1       5235
8     2       5255
9     3       5235
10    4       5210
11    5       5175

Put the values on a chart


plot(sample1$Year, sample1$Population, type="b")

At this point we can start with the search for a polynomial model that adequately approximates our data. First, we specify that we want a polynomial function of X, ie a raw polynomial , is different from the orthogonal polynomial. This is an important addition because the controls and the results will change in the two cases R. So we want a function of X like:

$$f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+ ... +\beta_nx^n$$

At what degree of the polynomial stop? Depends on the degree of precision that we seek. The greater the degree of the polynomial, the greater the accuracy of the model, but the greater the difficulty in calculating; we must also verify the significance of coefficients that are found. But let's get straight to the point.

In R for fitting a polynomial regression model (not orthogonal), there are two methods, among them identical. Suppose we seek the values of beta coefficients for a polynomial of degree 1, then 2nd degree, and 3rd degree:


fit1 <- lm(sample1$Population ~ sample1$Year)
fit2 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2))
fit3 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3))

Or we can write more quickly, for polynomials of degree 2 and 3:


fit2b <- lm(sample1$Population ~ poly(sample1$Year, 2, raw=TRUE))
fit3b <- lm(sample1$Population ~ poly(sample1$Year, 3, raw=TRUE))

The function poly is useful if you want to get a polynomial of high degree, because it avoids explicitly write the formula. If we specify raw=TRUE, the two methods provide the same output, but if we do not specify raw=TRUE (or rgb(153, 0, 0);">raw=F), the function poly give us the values of the beta parameters of an orthogonal polynomials, which is different from the general formula I wrote above, although the models are both effective.

Let's look at the output.


summary(fit2)
## or summary(fit2b)

Call:
lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2))

Residuals:
    Min      1Q  Median      3Q     Max 
-46.888 -18.834  -3.159   2.040  86.748 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5263.159     17.655 298.110  < 2e-16 ***
sample1$Year        29.318      3.696   7.933 4.64e-05 ***
I(sample1$Year^2)  -10.589      1.323  -8.002 4.36e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 38.76 on 8 degrees of freedom
Multiple R-squared: 0.9407,     Adjusted R-squared: 0.9259 
F-statistic: 63.48 on 2 and 8 DF,  p-value: 1.235e-05

The output of summary(fit2b) is the same. We obtained the values of beta0 (5263,159), beta1 (29,318) and beta2 (-10,589), which appear to be significant AII 3. The equation of polynomial of degree 2 of our model is:

$$f(x)=5263.1597+29.318x-10.589x^2$$

If we want a polynomial of 3rd degree, we have:


summary(fit3)
## of summary(fit3b)

Call:
lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2) + 
    I(sample1$Year^3))

Residuals:
    Min      1Q  Median      3Q     Max 
-32.774 -14.802  -1.253   3.199  72.634 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5263.1585    15.0667 349.324 4.16e-16 ***
sample1$Year        14.3638     8.1282   1.767   0.1205    
I(sample1$Year^2)  -10.5886     1.1293  -9.376 3.27e-05 ***
I(sample1$Year^3)    0.8401     0.4209   1.996   0.0861 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 33.08 on 7 degrees of freedom
Multiple R-squared: 0.9622,     Adjusted R-squared: 0.946 
F-statistic: 59.44 on 3 and 7 DF,  p-value: 2.403e-05

The equation is:

$$f(x)=5263.1585+14.3638x-10.5886x^2+0.8401x^3$$

In the latter case, however, the coefficients beta1 and beta3 are not significant, then the best model is the polynomial of 2nd degree. Furthermore look at the Multiple R-squared: in the 2nd degree model it is 94.07%, while in the 3rd degree model it is 96.22%. It seems that there has been an increase in accuracy of the model, but it is a significant increase? We can compare the two model with an ANOVA table:


anova(fit2, fit3)

Analysis of Variance Table

Model 1: sample1$Population ~ sample1$Year + I(sample1$Year^2)
Model 2: sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3)
  Res.Df     RSS Df Sum of Sq      F Pr(>F)  
1      8 12019.8                             
2      7  7659.5  1    4360.3 3.9848 0.0861 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since the p-value is greater than 0.05, we accept the null hypothesis: there wasn't a significant improvement of the model.

The biggest problem now is to represent graphically the result. In fact, R does not exist (as far as I know) a function for plotting polynomials found. We must therefore proceed with graphic artifacts still valid, but somewhat laborious.

First, we plotted the values, with the command seen before. This time only display the lines and not points, for convenience graphics:


plot(sample1$Year, sample1$Population, type="l", lwd=3)

Now add to this chart the progress of the 2nd degree polynomial, in this way:


points(sample1$Year, predict(fit2), type="l", col="red", lwd=2)

The function predict() compute the Y values given the X values. The the coordinates are linked with lines. Is not plotted the continuous, but the discrete. With a few values, this method is highly debilitating.

Let's add the graph of the polynomial of 3rd degree:


points(sample1&Year, predict(fit3), type="l", col="blue", lwd=2)

As you can see the two models have very similar trends.

If we would instead obtain the graph of continuous functions obtained, we proceed in this manner. First you create the polynomial equation we previously found:


pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]

Remember that:
- coefficient[1] = beta0
- coefficient[2] = beta1
- coefficient[3] = beta2
and so on.

At this point we plotted the coordinates of sample1 and then the created curve with curve(x):


plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
curve(pol2, col="red", lwd=2)

The point, however, disappear, but we can replace them with the command points:


points(sample1$Year, sample1$Population, type="p", lwd=3)

A note: you must follow the order of commands as I have described, otherwise the function curve creates a wrong graph. So summarizing the commands to get the continuous function, and the experimental points on the same graph are the following:


plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
curve(pol2, col="red", lwd=2)
points(sample1$Year, sample1$Population, type="p", lwd=3)

The graph we get is the following:

Now draw the graph of the polynomial of 3rd degree:


plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol3 <- function(x) fit3$coefficient[4]*x^3 + fit3$coefficient[3]*x^2 + fit3$coefficient[2]*x + fit3$coefficient[1]
curve(pol3, col="red", lwd=2)
points(sample1$Year, sample1$Population, type="p", lwd=3)

Web-site trend analysis with data from Google Analytics

2009-08-25T20:44:00.004+02:00

This post is a summary of my two previous posts on the trend analysis with the Cox-Stuart test and on simple linear regression. The goal that we propose is to assess the trend in the number of visits received from a site over a long time. I use Google Analytics, because this tool allows us to save the various reports in Excel CSV format. Let's see, step by step, how to save the reportage, and then how to import data from Excel to R, and finally how to estimate if the number of daily visitors follows an increasing or decreasing trend.

Let's start by creating an ad hoc report in Google Analytics. Once you have logged in, select the date range that we want to analyze. Then click onVisits.

At this point we can save the report, clicking on Export and then clicking on CSV for Excel.

Save the CSV file, and open it with Excel. Here's how it seems:

Now import the data into R. Import data from Excel to R is very simple. Simply select the column (or columns) of our interest (in our case the column Visits) and copy in the clipboard with CTRL + C (remember to select the cell Visits, because it will be useful):

Then open R and type the following command:


myvisit <- read.delim("clipboard")

myvisit

   Visits
1      33
2      41
3      34
4      45
5      46
6      37
7      31
8      37
9      34
10     34
11     48
12     39
13     33
...

It is a one column dataframe; the name of the column is Visits (so it is importat to select the header from Excel).

Now we can proceed with the analysis of trends in the two proposed ways: through a Cox-Stuart test e through the analysis of the simple linear regression.

The function to perform the Cox-Stuart test is available here. First we must convert the dataframe in a format that can be read by the function cox.stuart.test, like this:


visits <- c(myvisit$Visits)

I have created in this way, a vector (visits) that contains all data that were ordered in the column Visits of the dataframe myvisit. Now we provide a test of Cox-Stuart:


cox.stuart.test(visits)

        Cox-Stuart test for trend analysis

data:  
Increasing trend, p-value = 0.0012

The output is very clear: We have detected an increasing trend of visits, highly significant (since p-value < 0.5).

If we are not satisfied or sure of this result, we can take into account the slope of the regression line. Firstly may want to show the results. The vector contains the hits daily visits to the site. Now we create a sorted array of the days in question, the same length of the carrier hits:


days <- c(1 : length(visits))

Create a plot:


plot(days, visits, type="b")

Choosing type="b" I see dots and lines, as shown in figure:

From this plot is not easy to observe a possible trend of the progress of visits. We can still do a regression analysis. Evaluating the sign of the slope of the line, we can estimate whether the trend is increasing or decreasing:


fit <- lm(visits ~ days)
summary(fit)

Call:
lm(formula = visits ~ days)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.714  -6.197  -1.313   5.648  31.153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.79694    2.27151  13.998  < 2e-16 ***
days         0.19815    0.04242   4.671 1.04e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 10.81 on 90 degrees of freedom
Multiple R-squared: 0.1951,     Adjusted R-squared: 0.1862 
F-statistic: 21.82 on 1 and 90 DF,  p-value: 1.043e-05

The slope coefficient has a value of: b = 0.06251. It therefore has a positive sign, then one may think of an increasing trend. The value of the statistical t-test on the slope, and its relative p-value, indicate either that it is significant. We can therefore say that there is an increasing trend.

Finally, we can see the regression line directly on the plot previously obtained in this way:


plot(days, visits, type="b")
abline(fit, col="red", lwd=3)

The command abline allows us to add a line defined by the equation given, directly on the chart shown; the parameter "col" specifies the color and the "lwd" parameter specifies the thickness of the line. Observe now the graph:

It's obvious that there is an increasing trend, as said by the Cox-Stuart test.

Simple logistic regression on qualitative dichotomic variables

2009-08-20T12:43:00.009+02:00

In this post we will see briefly how to implement a logistic regression model if you have categorical variables, or qualitative, organized in double entry contingency tables. In this model the dependent variable (Y) and independent variable (X) are both dichotomies (or Bernoullian).

In general, the probability that Y = 1 as a function of predictors is:

$$P(Y=1|X=x)=\pi(x)=\frac{exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}{1+exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}$$

Our goal is to estimate the value of the beta parameters (regressors).

We begin to examine a model of simple logistic regression (with only one predictor).

Consider the following example. The table below shows the results of a study on gastroesophageal reflux. You want to evaluate how the presence of a stress factor can influence the onset of this disease.

First we import the values in R. We must create a table with double entry; proceed as follows:


reflux <- matrix(c(251,131,4,33), nrow=2)
colnames(reflux) <- c("reflNO", "reflYES")
rownames(reflux) <- c("stressNO", "stressYES")
table <- as.table(reflux)

table

          reflNO reflYES
stressNO     251       4
stressYES    131      33

Now adjust the data for the logistic regression. We must create a data frame:


dft <- as.data.frame(table)
dft

       Var1    Var2 Freq
1  stressNO  reflNO  251
2 stressYES  reflNO  131
3  stressNO reflYES    4
4 stressYES reflYES   33

We can now fit the model, and then perform the logistic regression in R:


fit <- glm(Var2 ~ Var1, weights = Freq, data = dft, family = binomial(logit))
summary(fit)


Call:
glm(formula = Var2 ~ Var1, family = binomial(logit), data = dft, 
    weights = Freq)

Deviance Residuals: 
     1       2       3       4  
-2.817  -7.672   5.765  10.287  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -4.1392     0.5040  -8.213  < 2e-16 ***
Var1stressYES   2.7605     0.5403   5.109 3.23e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 250.23  on 3  degrees of freedom
Residual deviance: 205.86  on 2  degrees of freedom
AIC: 209.86

Number of Fisher Scoring iterations: 6

First we comment the code to perform the regression. The logistic regression is called imposing the family: family = binomial(logit). The code Var2 ~ Var1 means that we want to create a model that will explain the variable var2 (presence or absence of reflux) as a function of the variable var1 (presence or absence of stressful events). In practice var2 is the independent variable Y, and Var1 is the dependent variable X (the regressors). Provided the formula to be analyzed, you specify the weight of each variable, data in column Freq of the dataframe dft (so we write weights = Freq and data = dft to specify the location where the values are contained).

The values of the parameters $\beta_0$ and $\beta_1$ are respectively the values (intercept) and Var1stress1. We can then write our empirical model:

$$\pi(x)=\frac{exp(-4.139+2.760x)}{1+exp(-4.139+2.760x)}$$

The independent variable x can be zero or one. If you assume value 0 (ie in the absence of stressful events), then the probability of having reflux is:

$$\pi(x=0)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=0.016=1.6\%$$

If there are stressful events (x = 1), the probability of having reflux is:

$$\pi(x=1)=\frac{exp(\beta_0+\beta_1)}{1+exp(\beta_0+\beta_1)}=0.20=20\%$$

The odds are:

$$odds(x=1)=\frac{\pi(1)}{1-\pi(1)}=exp(\beta_0+\beta_1)$$

$$odds(x=0)=\frac{\pi(0)}{1-\pi(0)}=exp(\beta_0)$$

We can finally calculate the odd ratio OR:

$$OR=\frac{odds(x=1)}{odds(x=0)}=15.807$$

A person who has experienced a stressful event has a propensity to develop gastroesophageal reflux 15.807 times larger than the person who has not undergone stressful events.

The probabilities and the odds can be readily calculated in R recalling that:

fit$coefficient[1] = $\beta_0$ (intercept)
fit$coefficient[2] = $\beta_1$

Furthermore:

summary(fit)$coefficient[1,2] = standard error of $\beta_0$
summary(fit)$coefficient[2,2] = standard error of $\beta_1$

And so we have:


pi0 <- exp(fit$coefficient[1]) / (1 + exp(fit$coefficient[1]))
pi1 <- exp(fit$coefficient[1] + fit$coefficient[2]) / (1 + exp(fit$coefficient[1]+fit$coefficient[2]))

odds0 <- pi0 / (1 - pi0)
odds1 <- pi1 / (1 - pi1)

OR <- odds1 / odds0

#the same result with:
OR <- exp(fit$coefficient[2])

#the confidence interval for OR is:
ORmin <- exp( fit$coefficient[2] - qnorm(.975) * summary(fit)$coefficient[2,2] )

ORmax <- exp( fit$coefficient[2] + qnorm(.975) * summary(fit)$coefficient[2,2] )

We can obtain the same result for the odd-ratio, using the simplify formula:

$$OR=\frac{ad}{bc}=\frac{251\cdot33}{4\cdot131}=15.807$$

that in R is:


OR <- (table[1,1]*table[2,2]) / (table[1,2]*table[2,1])

The acronym AIC stands for Akaike's information criterion. This parameter does not provide any data on the model just created. It
may be useful in comparing this model with other possibly taken into account (the model with lowest AIC is the better).

Trend Analysis with the Cox-Stuart test in R

2009-08-08T09:59:00.004+02:00

The Cox-Stuart test is defined as a little powerful test (power equal to 0.78), but very robust for the trend analysis. It is therefore applicable to a wide variety of situations, to get an idea of the evolution of values obtained. The proposed method is based on the binomial distribution. In R there is no function to perform a test of Cox-Stuart, so now we see the logical steps that are the basis of test and finally we can write the function ourself.

You want to assess whether there is an increasing or decreasing trend of the number of daily customers of a restaurant. We have the number of customers in 15 days:

Customers: 5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14

To perform the test of Cox-Stuart, the number of observations must be even. In our case we have 15 observations. Delete, therefore, the observation at position (N+1)/2 (here the observation with value = 20):


customers = c(5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14)

length(customers)
[1] 15

cust_even = customers[ -(length(customers)+1)/2 ]
length(cust_even)
[1] 14

Now we have 14 observations, and we can then proceed. Divide the observations into two vectors, the first containing the first half of the measures, and the second the second half:


fHalf = cust_even[1:7]
sHalf = cust_even[8:14]

fHalf
[1]  5  9 12 18 17 16 19

sHalf
[1]  4  3 18 16 17 15 14

Now subtract, value by value, the content of the two vectors:


difference = fHalf - sHalf

difference
[1]  1  6 -6  2  0  1  5

Now consider only the signs of the contents of the vector difference


signs = sign(difference)

signs
[1]  1  1 -1  1  0  1  1

A difference has value 0 and therefore also in the vector with the signs there is a value equal to 0. This must be eliminated:


signs = signs[ signs != 0 ]

signs
[1]  1  1 -1  1  1  1

We obtained six differences, and then six signs. Now we have to count the number of positive-signs and the number of negative-signs:


pos = signs[signs > 0]
neg = signs[signs < 0]

length(pos)
[1] 5

length(neg)
[1] 1

Now we choose the number of signs that is smaller. In this case we choose the number of negative signs (1). We compute the probability to obtain x = 1 successes on N = 6 experiments, each of which yields success with probability p = 0.5 (binomial distribution):


pbinom(1, 6, 0.5)
[1] 0.109375

The value so calculated is higher than 0.05 (we choose a significance level of 95%). Therefore there is no significant trend (which would have been in decline since the number of negative signs is minor).
If the value was less than 0.05, we accepted the hypothesis of a significant trend.

Now try to fit a regression model, and observe the p-value of the slope: the coefficient b is not significant.


customers = c(5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14)
days <- c(1:length(customers))
model <- lm(customers ~ days)
summary(model)

Call:
lm(formula = customers ~ days)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.090  -2.173   1.352   3.967   6.467 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  11.3048     3.1104   3.634  0.00303 **
days          0.2786     0.3421   0.814  0.43014   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 5.724 on 13 degrees of freedom

Here is the code to perform a Cos-Stuart test, written by me.


cox.stuart.test =
function (x)
{
  method = "Cox-Stuart test for trend analysis"
  leng = length(x)
  apross = round(leng) %% 2
  if (apross == 1) {
    delete = (length(x)+1)/2
    x = x[ -delete ] 
  }
  half = length(x)/2
  x1 = x[1:half]
  x2 = x[(half+1):(length(x))]
  difference = x1-x2
  signs = sign(difference)
  signcorr = signs[signs != 0]
  pos = signs[signs>0]
  neg = signs[signs<0]
  if (length(pos) < length(neg)) {
    prop = pbinom(length(pos), length(signcorr), 0.5)
    names(prop) = "Increasing trend, p-value"
    rval <- list(method = method, statistic = prop)
    class(rval) = "htest"
    return(rval)
  }
  else {
    prop = pbinom(length(neg), length(signcorr), 0.5)
    names(prop) = "Decreasing trend, p-value"
    rval <- list(method = method, statistic = prop)
    class(rval) = "htest"
    return(rval)
  }
}

We can now use the function just created:


customers = c(5, 9, 12, 18, 17, 16, 19, 20, 4, 3, 18, 16, 17, 15, 14)
cox.stuart.test(customers)

        Cox-Stuart test for trend analysis

data:  
Decreasing trend, p-value = 0.1094

Two-way analysis of variance: two-way ANOVA in R

2009-08-07T20:28:00.002+02:00

The one-way analysis of variance is a useful technique to verify if the means of more groups are equals. But this analysis may not be very useful for more complex problems. For example, it may be necessary to take into account two factors of variability to determine if the averages between the groups depend on the group classification ( "zone") or the second variable that is to consider ("block"). In this case should be used the two-way analysis of variance (two-way ANOVA).

We begin immediately with an example so as to facilitate the understanding of this statistical method. The data collected are organized into double entry tables.

The director of a company has collected revenue (thousand dollars) for 5 years and under per month. You want to see if the revenue depends on the year and/or month, or if they are independent of these two factors.

Conceptually, the problem may be solved by an horizontal ANOVA and a vertical ANOVA, in order to verify if the average revenues per year are the same, and if they are equal to the average revenue computed by month. This would require many calculations, and so we prefer to use the two-way ANOVA, which provides the result instantly.
This is the table of revenue classified by year and month:

$$\begin{tabular}{|c||ccccc||r|}\hline Months & Year 1 & Year 2 & Year 3 & Year 4 & Year 5\\\hline January&15&18&22&23&24\\ February&22&25&15&15&14\\ March&18&22&15&19&21\\ April&23&15&14&17&18\\ May&23&15&26&18&14\\ June&12&15&11&10&8\\ July&26&12&23&15&18\\ August&19&17&15&20&10\\ September&15&14&18&19&20\\ October&14&18&10&12&23\\ November&14&22&19&17&11\\ December&21&23&11&18&14\\ \hline \end{tabular}$$

As with the one-way ANOVA, even here the aim is to structure a Fisher's F-test to assess the significance of the variable "month" and of the variable "year", determine if the revenues are dependent on one (or both) the criteria for classification.
How to perform the two-way ANOVA in R? First creates an array containing all the values tabulated, transcribed by rows:


revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21, 
         23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18, 
         19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11, 
         21,23,11,18,14)

According to the months, you create a factor of 12 levels (the number of rows) with 5 repetitions (the number columns) in this manner:


months = gl(12,5)

According to the years you create a factor with 5 levels (the number of column) and 1 recurrence, declaring the total number of observations (the length of the vector revenue):


years = gl(5, 1, length(entrate))

Now you can fit the linear model and produce the ANOVA table:


fit = aov(revenue ~ months + years)

anova(fit)

Analysis of Variance Table

Response: revenue
          Df Sum Sq Mean Sq F value Pr(>F)
months    11 308.45   28.04  1.4998 0.1660
years      4  44.17   11.04  0.5906 0.6712
Residuals 44 822.63   18.70

Now interpret the results.
The significance of the difference between months is: F = 1.4998. This value is lower than the value tabulated and indeed p-value > 0.05. So we accept the null hypothesis: the means of revenue evaluated according to the months are equal, then the variable "months" has no effect on revenue.

The significance of the difference between years is: F = 0.5906. This value is lower than the value tabulated and indeed p-value > 0.05. So we accept the null hypothesis: the means of revenue evaluated according to the years are equal, then the variable "years" has no effect on revenue.

Simple linear regression

2009-08-06T07:30:00.003+02:00

We use the regression analysis when, from the data sample, we want to derive a statistical model that predicts the values of a variable (Y, dependent) from the values of another variable (X, independent). The linear regression, which is the simplest and most frequent relationship between two quantitative variables, can be positive (when X increase, Y increase too) or negative (when X increase, Y decrease): this is indicated by the sign of the coefficient b.

To build the line that describes the distribution of points, we might refer to different principles. The most common is the least squares method (or Model I), and this is the method used by the statistical software R.

Suppose you want to obtain a linear relationship between weight (kg) and height (cm) of 10 subjects.

Height: 175, 168, 170, 171, 169, 165, 165, 160, 180, 186
Weight: 80, 68, 72, 75, 70, 65, 62, 60, 85, 90

The first problem is to decide what is the dependent variable Y and waht is the independent variable X. In general, the independent variable is not affected by an error during the measurement (or affected by random error), while the dependent variable is affected by error. In our case we can assume that the variable weight is the independent variable (X), and the dependent variable height (Y).
So our problem is to find a linear relationship (formula) that allows us to calculate the height, known as the weight of an individual. The simplest formula is that of a broad line of type Y = a + bX. The simple regression line in R is calculated as follows:


height = c(175, 168, 170, 171, 169, 165, 165, 160, 180, 186)
weight = c(80, 68, 72, 75, 70, 65, 62, 60, 85, 90)
 
model = lm(formula = height ~ weight, x=TRUE, y=TRUE)
model

Call:
lm(formula = height ~ weight, x = TRUE, y = TRUE)

Coefficients:
(Intercept)       weight  
   115.2002       0.7662

The correct syntax of the formula stated in lm is: Y ~ X, then you declare first the dependent variable, and after the independent variable (or variables).
The output of the function is represented by two parameters a and b: a=115.2002 (intercept), b=0.7662 (the slope).

The simple calculation of the line is not enough. We must assess the significance of the line, ie if the slopeb differs from zero significantly. This may be done with a Student's t.test or with a Fisher's F-test.
In R both can be retrieved very quickly, with the function summary(). Here's how:


model <- lm(height ~ weight)
summary(model)

Call:
lm(formula = height ~ weight)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6622 -0.9683 -0.1622  0.5679  2.2979 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 115.20021    3.48450   33.06 7.64e-10 ***
weight        0.76616    0.04754   16.12 2.21e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.405 on 8 degrees of freedom
Multiple R-squared: 0.9701,     Adjusted R-squared: 0.9664 
F-statistic: 259.7 on 1 and 8 DF,  p-value: 2.206e-07

Here too there are the values of the parameters a and b.
The Student's t-test on the slope in this case has the value 16.12; the Student's t-test on the intercept has value 16.12; the value of the Fisher's F test is 259.7 (is the same value would be achieved by performing an ANOVA on the same data: anova(model)). The p-values of the t-tests and the F-test are less then 0.05, so the model we found is significant.
The Multiple R-squared is the coefficient of determination. It provides a measure of how well future outcomes are likely to be predicted by the model. In this case, the 97.01% of the data are well predicted (with 95% of significance) by our model.

We can plot on a graph the data points and the regression line, in this way:


plot(weight, height)
abline(model)

Contingency table and the study of the correlation between qualitative variables: Pearson's Chi-squared test

2009-08-05T07:30:00.001+02:00

If you have qualitative variable, it is possible to verify the correlation by studying a contingency table R by C, using the Pearson's Chi-squared test.

A casino wants to study the correlation between the modes of play and the number of winners by age group, to see if the number of winners depends on the type of game that you chose to do, in light of experience. It has the following data (number of winners / 100 player for game and age-group):

$$\begin{tabular}{c|ccc}&Age\\\hline Game&20-30&31-40&41-50\\ \hline Roulette&44&56&55\\ Black-jack& 66& 88& 23\\Poker& 15& 29& 45 \end{tabular}$$

In R, we must first build a matrix with the data collected:


table <- matrix(c(44,56,55, 66,88,23, 15,29,45), nrow=3, byrow=TRUE)

Now we can compute the chi-squared correlation coefficient:


chisq.test(table)

        Pearson's Chi-squared test

data:  table 
X-squared = 46.0767, df = 4, p-value = 2.374e-09

I reject the null hypothesis H0 in favor of the alternative hypothesis (p-value < 0.05): there is a strong correlation between the age of the player and his probability to win.

Non-parametric methods for the study of the correlation: Spearman's rank correlation coefficient and Kendall tau rank correlation coefficient

2009-08-04T07:30:00.001+02:00

We saw in the previous post, how to study the correlation between variables that follow a Gaussian distribution with the Pearson product-moment correlation coefficient. If it is not possible to assume that the values follow gaussian distributions, we have two non-parametric methods: the Spearman's rho test and Kendall's tau test.

For example, you want to study the productivity of various types of machinery and the satisfaction of operators in their use (as with a number from 1 to 10). These are the values:

Productivity: 5, 7, 9, 9, 8, 6, 4, 8, 7, 7
Satisfaction: 6, 7, 4, 4, 8, 7, 3, 9, 5, 8

Begin to use first the Spearman's rank correlation coefficient:


a <- c(5, 7, 9, 9, 8, 6, 4, 8, 7, 7)
b <- c(6, 7, 4, 4, 8, 7, 3, 9, 5, 8)

cor.test(a, b, method="spearman")

        Spearman's rank correlation rho

data:  a and b 
S = 145.9805, p-value = 0.7512
alternative hypothesis: true rho is not equal to 0 
sample estimates:
      rho 
0.1152698

The statistical test gives us as a result rho = 0.115, which indicates a low correlation (not parametric) between the two sets of values.
The p-value > 0.05 allows us to accept the value of rho calculated, being statistically significant.

Now we check the same data with the Kendall tau rank correlation coefficient:


a <- c(5, 7, 9, 9, 8, 6, 4, 8, 7, 7)
b <- c(6, 7, 4, 4, 8, 7, 3, 9, 5, 8)
 
cor.test(a, b, method="kendall")

        Kendall's rank correlation tau

data:  a and b 
z = 0.5555, p-value = 0.5786
alternative hypothesis: true tau is not equal to 0 
sample estimates:
     tau 
0.146385

Even with the Kendall test, the correlation is very low (tau = 0.146), and significant (p-value > 0.05).

Parametric method for the study of the correlation: the Pearson r-test

2009-08-03T09:33:00.003+02:00

Suppose you want to study whether there is a correlation between 2 sets of data. To do this we compute the Pearson product-moment correlation coefficient, which is a measure of the correlation (linear dependence) between two variables X and Y; then we compute the value of a t-test to study the significance of the Pearson coefficient R. We can use this test when the data follow a Gaussian distribution.

A new test to measure IQ is subjected to 10 volunteers. You want to see if there is a correlation between the new experimental test and the classical test, in order to replace the old test with the new test. These the values:

Old test: 15, 21, 25, 26, 30, 30, 22, 29, 19, 16
New test: 55, 56, 89, 67, 84, 89, 99, 62, 83, 88

The software R has a single function, easily recalled, which gives us directly the value of the Pearson coefficient and the t-statistical test for checking the significance of the coefficient:


a = c(15, 21, 25, 26, 30, 30, 22, 29, 19, 16)
b = c(55, 56, 89, 67, 84, 89, 99, 62, 83, 88)

cor.test(a, b)

        Pearson's product-moment correlation

data:  a and b 
t = 0.4772, df = 8, p-value = 0.646
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.5174766  0.7205107 
sample estimates:
     cor 
0.166349

The value of the coefficient of Pearson is 0.166: it is a very low value, which indicates a poor correlation between the variables.
Furthermore, the p-value is greater than 0.05; so we reject the null hypothesis: then the Pearson coefficient is significant.
So we can say that there is no correlation between the results of both tests.

Kruskal-Wallis one-way analysis of variance

2009-07-31T10:46:00.001+02:00

If you have to perform the comparison between multiple groups, but you can not run a ANOVA for multiple comparisons because the groups do not follow a normal distribution, you can use the Kruskal-Wallis test, which can be applied when you can not make the assumption that the groups follow a gaussian distribution.
This test is similar to the Wilcoxon test for 2 samples.

Suppose you want to see if the means of the following 4 sets of values are statistically similar:

Group A: 1, 5, 8, 17, 16
Group B: 2, 16, 5, 7, 4
Group C: 1, 1, 3, 7, 9
Group D: 2, 15, 2, 9, 7

To use the test of Kruskal-Wallis simply enter the data, and then organize them into a list:


a = c(1, 5, 8, 17, 16)
b = c(2, 16, 5, 7, 4)
c = c(1, 1, 3, 7, 9)
d = c(2, 15, 2, 9, 7)

dati = list(g1=a, g2=b, g3=c, g4=d)

Now we can apply the kruskal.test() function:


kruskal.test(dati)

        Kruskal-Wallis rank sum test

data:  dati 
Kruskal-Wallis chi-squared = 1.9217, df = 3, p-value = 0.5888

The value of the test statistic is 1.9217. This value already contains the fix when there are ties (repetitions). The p-value is greater than 0.05; also the value of the test statistic is lower than the chi-square-tabulation:


qchisq(0.950, 3)
[1] 7.814728

The conclusion is therefore that I accept the null hypothesis H0: the means of the 4 groups are statistically equal.

Analysis of variance: ANOVA, for multiple comparisons

2009-07-30T09:32:00.002+02:00

Analysis of variance: ANOVA, for multiple comparisons

The ANOVA model can be used to compare the mean of several groups with each other, using a parametric method (assuming that the groups follow a Gaussian distribution).
Proceed with the following example:

The manager of a supermarket chain wants to see if the consumption in kilowatts of 4 stores between them are equal. He collects data at the end of each month for 6 months. The results are:

Store A: 65, 48, 66, 75, 70, 55
Store B: 64, 44, 70, 70, 68, 59
Store C: 60, 50, 65, 69, 69, 57
Store D: 62, 46, 68, 72, 67, 56

To proceed with the verification ANOVA, we must first verify the homoskedasticity (ie test for homogeneity of variances). The software R provides two tests: the Bartlett test, and the Fligner-Killeen test.

We begin with the Bartlett test.

First we create the 4 vectors:


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

Now we combine the 4 vectors in a single vector:


dati = c(a, b, c, d)

Now, on this vector in which are stored all the data, we create the 4 levels:


groups = factor(rep(letters[1:4], each = 6))

We can observe the contents of the vector groups simply by typing groups + [enter].

At this point we start the Bartlett test:


bartlett.test(dati, groups)

        Bartlett test of homogeneity of variances

data:  dati and groups 
Bartlett's K-squared = 0.4822, df = 3, p-value = 0.9228

The function gave us the value of the statistical tests (K squared), and the p-value. Can be argued that the variances are homogeneous since p-value > 0.05. Alternatively, we can compare the Bartlett's K-squared with the value of chi-square-tables; we compute that value, assigning the value of alpha and degrees of freedom at the qchisq function:


qchisq(0.950, 3)
[1] 7.814728

Chi-squared > Bartlett's K-squared: we accept the null hypothesis H0 (variances homogeneity)

We try now to check the homoskedasticity, with the Fligner-Killeen test.
The syntax is quite similar, and then proceed quickly.


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

dati = c(a, b, c, d)

groups = factor(rep(letters[1:4], each = 6))

fligner.test(dati, groups)

        Fligner-Killeen test of homogeneity of variances

data:  dati and groups 
Fligner-Killeen:med chi-squared = 0.1316, df = 3, p-value = 0.9878

The conclusions are similar to those for the test of Bartlett.

Having verified the homoskedasticity of the 4 groups, we can proceed with the ANOVA model.

First organize the values, fitting the model:


fit = lm(formula = dati ~ groups)

Then we analyze the ANOVA model:


anova (fit)

Analysis of Variance Table

Response: dati
          Df  Sum Sq Mean Sq F value Pr(>F)
groups     3    8.46    2.82  0.0327 0.9918
Residuals 20 1726.50   86.33

The output of the function is a classical ANOVA table with the following data:
Df = degree of freedom
Sum Sq = deviance (within groups, and residual)
Mean Sq = variance (within groups, and residual)
F value = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
Pr(>F) = p-value

Since p-value > 0.05, we accept the null hypothesis H0: the four means are statistically equal. We can also compare the computed F-value with the tabulated F-value:

qf(0.950, 20, 3)
[1] 8.66019

Tabulated F-value > computed F-value: we accept the null hyptohesis.