Martin Batholdy | 25 Dec 17:34 2012

aggregate / collapse big data frame efficiently

Hi,

I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable;

here is the sample code:

x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))

aggregate(x, list(x[,1]), mean)

Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) –
and it takes very very long (actually at some point I just stopped it).

Is there anything that can be done to make the aggregate routine more efficient?
Or is there a different approach that would work faster?

Thanks for any suggestions!

Jeff Newmiller | 25 Dec 18:00 2012
Picon
Picon

Re: aggregate / collapse big data frame efficiently

You might consider using the sqldf package.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil <at> dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Martin Batholdy <batholdy <at> googlemail.com> wrote:

>Hi,
>
>
>I need to aggregate rows of a data.frame by computing the mean for rows
>with the same factor-level on one factor-variable;
>
>here is the sample code:
>
>
>x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
>aggregate(x, list(x[,1]), mean)
>
>
>Now my problem is, that the actual data-set is much bigger (120 rows
>and approximately 100.000 columns) – and it takes very very long
>(actually at some point I just stopped it).
>
(Continue reading)

Patrick Burns | 25 Dec 19:12 2012
Picon

Re: aggregate / collapse big data frame efficiently

I'd suggest the 'data.table' package.  That is
one of the prime uses it was created for.

Pat

On 25/12/2012 16:34, Martin Batholdy wrote:
> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns)
– and it takes very very long (actually at some point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
> ______________________________________________
> R-help <at> r-project.org mailing list
(Continue reading)

jim holtman | 25 Dec 19:20 2012
Picon

Re: aggregate / collapse big data frame efficiently

According to the way that you have used 'aggregate', you are taking
the column means.  Couple of suggestions for faster processing:

1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'

I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:

>  n <- 100000
> nLevels <- 10
> nRows <- 120
> Cols <- list(rep(list(sample(nRows)), n))
> df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
> colnames(df)[-1] <- paste0('col', 1:n)
>
> # convert to matrix for faster processing
> df.m <- as.matrix(df[, -1])  # remove levels column
> str(df.m)
 int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
> system.time({
+ # split the indices of rows for each level
+ x <- split(seq(nrow(df)), df$levels)
+ result <- sapply(x, function(a) colMeans(df.m[a, ]))
+ })
   user  system elapsed
(Continue reading)

arun | 25 Dec 21:22 2012
Picon

Re: aggregate / collapse big data frame efficiently

Hi,

Jim's method was found to be faster than data.table()

n <- 10000
 nLevels <- 10
 nRows <- 120
 Cols <- list(rep(list(sample(nRows)), n))
 df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
 colnames(df)[-1] <- paste0('col', 1:n) 
 # convert to matrix for faster processing
 df.m <- as.matrix(df[, -1])  # remove levels column 
 system.time({
 # split the indices of rows for each level
 x <- split(seq(nrow(df)), df$levels)
 result <- sapply(x, function(a) colMeans(df.m[a, ]))
 }) 
 # user  system elapsed 
# 0.056   0.000   0.056   

library(data.table)
df.dt<-data.table(df)
setkey(df.dt,levels)
 system.time({ result1<- df.dt[,lapply(.SD,mean),by=levels]})
#  user  system elapsed 
#  7.756   0.000   7.771 
 system.time({result2<-df.dt[,list(Mean=colMeans(.SD)),by=levels]})
# user  system elapsed 
 # 2.188   0.000   2.193  

(Continue reading)

Wensui Liu | 26 Dec 04:32 2012
Picon

Re: aggregate / collapse big data frame efficiently

aggregate() is not efficient. try by().

On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy <at> googlemail.com>wrote:

> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows
> with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and
> approximately 100.000 columns) – and it takes very very long (actually at
> some point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more
> efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
(Continue reading)

arun | 25 Dec 18:08 2012
Picon

Re: aggregate / collapse big data frame efficiently

Hi,
You could use library(data.table) 
x <- data.frame(A=rep(letters,2), B=rnorm(52), C=rnorm(52), D=rnorm(52))
res<- with(x,aggregate(cbind(B,C,D),by=list(A),mean))
colnames(res)[1]<-"A"

 x1<-data.table(x)
res2<- x1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A]
 identical(res,data.frame(res2))
#[1] TRUE

Just for comparison:
set.seed(25)
xnew<-data.frame(A=rep(letters,1500),B=rnorm(39000),C=rnorm(39000),D=rnorm(39000))
system.time(resnew<-with(xnew,aggregate(cbind(B,C,D),by=list(A),mean)))
 #user  system elapsed 
 # 0.152   0.000   0.152 

xnew1<-data.table(xnew)
system.time(resnew1<- xnew1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A])
# user  system elapsed 
 # 0.004   0.000   0.005 

A.K.

----- Original Message -----
From: Martin Batholdy <batholdy <at> googlemail.com>
To: "r-help <at> r-project.org" <r-help <at> r-project.org>
Cc: 
Sent: Tuesday, December 25, 2012 11:34 AM
(Continue reading)


Gmane