The reshape function

1 comment

The other day I wrote about the R functions by, apply and friends, which allow me to operate on subsets of data. All those functions work nicely, if the data is given in the right format. More often than not it isn't and I have to reshape the data beforehand. Thus, time to discuss the reshape function. I will focus on the reshape function in base R, and not the package of the same name.

I use Fischer's iris data set again, as it is readily available after starting R. The iris data set has 150 observation and the first 6 rows look like this:

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I would like to create a box whisker plot, showing the measurements of the observations for each of the species, as in the chart below.


I know, that if I had all measurements in one column and the dimension in another column, I could produce a graph like this in one line with lattice.

library(lattice)
bwplot(Measurement ~ Species | Dimension, data=reshaped.iris)

Hence the reshape function is what I need. From the help file I learn that I want to transform my data from a wide format into a long format (direction="long"). In the long format I would like a varibale with the measurements (v.names="Measurement"), which I get by running through the first four columns (varying=1:4). I know which measurement I am reading by looking at the column names (times=names(iris)[1:4]), and I capture the dimension names in a new variable (timevar="Dimension"). This gives me the following statement:
reshaped.iris <- reshape(iris, varying=1:4, v.names="Measurement", 
                         timevar="Dimension", times=names(iris)[1:4], 
                         idvar="Measure ID", direction="long") 

head(reshaped.iris)
               Species    Dimension Measurement Measure ID
1.Sepal.Length  setosa Sepal.Length         5.1          1
2.Sepal.Length  setosa Sepal.Length         4.9          2
3.Sepal.Length  setosa Sepal.Length         4.7          3
4.Sepal.Length  setosa Sepal.Length         4.6          4
5.Sepal.Length  setosa Sepal.Length         5.0          5
6.Sepal.Length  setosa Sepal.Length         5.4          6
That's it, I can create the lattice box-whisker plot.

In my next example I would like the measurements of length and width in separate columns and capture the flower part in a new variable, so I can create scatterplots of length against width. Tweaking the reshape statement slightly gives me:

reshaped.iris.sp <- reshape(iris, varying=list(c(1,3),c(2,4)),
                            v.names=c("Length", "Width"), 
                            timevar="Part", times=c("Sepal", "Petal"),
                            idvar="Measure ID", direction="long")

head(reshaped.iris.sp)
        Species  Part Length Width Measure ID
1.Sepal  setosa Sepal    5.1   3.5          1
2.Sepal  setosa Sepal    4.9   3.0          2
3.Sepal  setosa Sepal    4.7   3.2          3
4.Sepal  setosa Sepal    4.6   3.1          4
5.Sepal  setosa Sepal    5.0   3.6          5
6.Sepal  setosa Sepal    5.4   3.9          6

xyplot(Length ~  Width | Species, groups=Part, 
       data=reshaped.iris.sp, auto.key=list(space="right"))

Let's swap Part against Species.
xyplot(Length ~  Width | Part, groups=Species, 
       data=reshaped.iris.sp, auto.key=list(space="right"))

I think, the charts illustrate quite nicely why the iris data set has become a typical test case for many classification techniques in machine learning.

1 comment :

Steve said...

Cool Post. This is helpful. Your blog is great. In keeping with R's "there are many ways of doing something" approach I approached the first problem by using the melt command as follows. (It is an add-on of course).

my.melt = melt(iris,id.var="Species",variable_name="Dimension")
bwplot(value ~ Species | Dimension, data=my.melt,layout=c(4,1))

In the second case I did it using basic data frame manipulation because that is the frame of mind I've been in recently. Using reshape or melt is probably more elegant and general though I also like to point out that knowing how to "sling around" data frames can be a very useful skill. This could be consolidated even more but probably at the expense of readability.

df1 = cbind(iris[c(1:2,5)],Part = unlist(strsplit(names(iris)[1],".",fixed=T))[1])
df2 = cbind(iris[c(3:4,5)],Part = unlist(strsplit(names(iris)[3],".",fixed=T))[1])

names(df1)[1:2]=c("Length","Width"); names(df2)[1:2]=c("Length","Width")
xyplot(Length ~ Width|Species, groups=Part,data=rbind(df1,df2), auto.key=list(space="right"),layout=c(3,1))

Post a Comment