Re: [R-SIG-Mac] multicore package: collecting results

Vincent Aubanel Thu, 30 Jun 2011 08:19:55 -0700

Le 30 juin 2011 à 15:36, Simon Urbanek a écrit :

> 
> On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:
> 
>> Thanks for this, it's now dead fast, as one could conceivably expect.
>> Simon's solution is astonishingly fast, however I had to reconstruct the 
>> factors and their levels which were (expectedly) lost during the c() 
>> operation.
> 
> 
> One way to avoid it is to use as.character() on factors inside the parallel 
> function, so the pieces don't have factors. You can create a factor at the 
> end and it should be faster, because factor() calls as.character() anyway so 
> it will be a no-op by that point.


It is faster, thanks! Slightly for the parallel loop (because of removal of 
unnecessary as.character() operations) and down to about 3 s for the total time 
of converting into factors. I thought that maintaining data as factors was 
somewhat more economical and faster than as characters...

Vincent

> 
> Cheers,
> S
> 
> 
>> Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 
>> million rows data frame it is still 2x faster than the elegant one line 
>> solution.
>> 
>> Some figures of performance:
>> 
>>> t <- proc.time()
>>> dl <- mclapply(lsessions, mcfun, mc.cores=cores)
>>> print(proc.time()-t)
>> utilisateur     système      écoulé 
>>   171.894      47.696      28.713
>> 
>>> l <- dl
>>> all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) 
>>> x[[i]])))
>>> names(all) = names(l[[1]])
>>> #attr(all, "row.names") = seq.int(all[[1]])
>>> attr(all, "row.names") = c(NA, -length(all[[1]]))
>>> class(all) = "data.frame"
>> utilisateur     système      écoulé 
>>     0.412       0.280       0.708 
>> 
>>> all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
>> ...
>> utilisateur     système      écoulé 
>>     4.852       2.349       7.038
>> 
>>> my_df = do.call(rbind, dl)
>> utilisateur     système      écoulé 
>>     9.791       5.411      15.039 
>> 
>> Thanks to both of you!
>> 
>> Vincent
>> 
>> 
>> Le 29 juin 2011 à 21:48, Simon Urbanek a écrit :
>> 
>>> 
>>> On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
>>> 
>>>> Is the slowdown happening while mclapply runs or while you're doing
>>>> the rbind? If the latter, I wonder if the code below is more efficient
>>>> than using rbind inside a loop:
>>>> 
>>>> my_df = do.call( rbind , my_list_from_mclapply )
>>>> 
>>> 
>>> Another potential issue is that data frames do many sanity checks that are 
>>> due to row.names handling etc. If you don't use row.names *and* know in 
>>> advance that the concatenation is benign *and* your data types are 
>>> compatible, you can usually speed things up immensely by operating on lists 
>>> instead and converting to a dataframe at the very end by declaring the 
>>> resulting list conform to the data.frame class. Again, this only works if 
>>> you really know what you're doing but the speed up can be very big (usually 
>>> orders of magnitude). This is a general advice, not in particular for 
>>> rbind. Whether it would work for you or not is easy to test - something like
>>> 
>>> l = my_list_from_mclapply
>>> all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) 
>>> x[[i]])))
>>> names(all) = names(l[[1]])
>>> attr(all, "row.names") = c(NA, -length(all[[1]]))
>>> class(all) = "data.frame"
>>> 
>>> Again, make sure all the assumptions above are satisfied before using.
>>> 
>>> Cheers,
>>> Simon
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <[email protected]> 
>>>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I'm using mclapply() of the multicore package for processing chunks of 
>>>>> data in parallel --and it works great.
>>>>> 
>>>>> But when I want to collect all processed elements of the returned list 
>>>>> into one big data frame it takes ages.
>>>>> 
>>>>> The elements are all data frames having identical column names, and I'm 
>>>>> using a simple rbind() inside a loop to do that. But I guess it makes 
>>>>> some expensive checking computations at each iteration as it gets slower 
>>>>> and slower as it goes. Writing out to disk individual files, 
>>>>> concatenating with the system and reading back from disk the resulting 
>>>>> file is actually faster...
>>>>> 
>>>>> Is there a magic argument to rbind() that I'm missing, or is there any 
>>>>> other solution to collect the results of parallel processing efficiently?
>>>>> 
>>>>> Thanks,
>>>>> Vincent
>>>>> 
>>>>> _______________________________________________
>>>>> R-SIG-Mac mailing list
>>>>> [email protected]
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> [email protected]
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>> 
>>>> 
>>> 
>> 
>> 
> 

_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] multicore package: collecting results

Reply via email to