Hi Jeff,
Thanks for the prompt. It looks like bpiterate or bpstream was intended
but didn't quite make it into BiocParallel. I'll discuss with Martin to
see if I'm missing other history / past discussions and then add it in.
Ryan had some ideas for parallel streaming we discussed at Bioc2014 so
this is timely. Both concepts can be revisited and implemented in some form.
Greg,
Just wanted to confirm it's ok with you that we put an iteration of
sclapply in BiocParallel?
Valerie
On 08/06/2014 07:16 AM, Johnston, Jeffrey wrote:
Hi,
I have been using FastqStreamer() and yield() to process a large fastq file in
chunks, modifying both the read and name and then appending the output to a new
fastq file as each chunk is processed. This works great, but would benefit
greatly from being parallelized.
As far as I can tell, this problem isn’t easily solved with the existing
parallel tools because you can’t determine how many jobs you’ll need in advance
(you just call yield() until it stops returning reads).
After some digging, I found the sclapply() function in the HTSeqGenie package
by Gregoire Pau, which he describes as a “multicore dispatcher”:
https://stat.ethz.ch/pipermail/bioc-devel/2013-October/004754.html
I wasn’t able to get the package to install from source due to some
dependencies (there are no binaries for Mac), but I did extract the function
and adapt it slightly for my use case. Here’s an example:
processChunk <- function(fq_chunk) {
# manipulate fastq reads here
}
yieldHelper <- function() {
fq <- yield(fqstream)
if(length(fq) == 0) return(NULL)
fq
}
fqstream <- FastqStreamer(“…”, n=1e6)
sclapply(yieldHelper, processChunk, max.parallel.jobs=4)
close(fqstream)
Based on the discussion linked above, it seems like there was some interest in
integrating this idea into BiocParallel. I would find that very useful as it
improves performance quite a bit and can likely be applied to numerous
stream-based processing tasks.
I will point out that in my case above, the processChunk() function doesn’t
return anything. Instead it appends the modified fastq records to a new file. I
have to use the Unix lockfile command to ensure that only one child process
appends to the output file at a time. I am not certain if there is a more
elegant solution to this (perhaps a queue that is emptied by a dedicated writer
process?).
Thanks,
Jeff
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: voben...@fhcrc.org
Phone: (206) 667-3158
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel