[R] Parallel Scan of Large File

Ryan Garner Tue, 07 Dec 2010 21:43:17 -0800

Is it possible to parallel scan a large file into a character vector in 1M
chunks using scan() with the "doMC" package? Furthermore, can I specify the
tasks for each child?


i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M
records at time (all 8 cores scan 1M records at a time) from a file with 40M
records total.

file <- file("data.txt","r")
child <- foreach(i = icount(40)) %dopar%
{
    scan(file,what = "character",sep = "\n",skip = 0,nlines = 1e6)
}

Thus, each child would have a different skip argument. child[[1]]: skip = 0,
child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]:
skip = 39e6 + 1. I would then end up with a list of 40 vectors with
child[[1]] containing records 1 to 1000000, child[[2]] containing records
1000001 to 2000000, ... ,child[[40]] containing records 39000001 to
40000000. 

Also, would one file connection suffice or does their need to be a file
connection that opens and closes for each child?




-- 
View this message in context: 
http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Parallel Scan of Large File

Reply via email to