[Bioc-devel] Proof-of-concept parallel preloading FastqStreamer

Ryan Mon, 30 Sep 2013 17:01:45 -0700

Hi all,

I have previously written an Rscript to read, filter, and write largefastq files using FastqSteamer to read. Through some complicated tricks,I was able to get the input to happen in parallel with the processingand output (using parallel::mcparallel and friends). In other words,while my script was processing and writing out the nth block of reads,another process was reading the (n+1)th block of reads at the same time.This almost doubled the speed of my script (the server had sufficientI/O bandwidth to parallelize reads and writes to disk). Since then, I'vebeen wanting to generalize this pattern, and I have just now made aworking proof of concept. It is a wrapper for FastqStreamer that runs ina separate process and uses parallel:::sendMaster to send each block tothe main script, and then calls yield on the FastqStreamer to preloadthe next block while the script is processing the current one. You canview and download the script here:


https://gist.github.com/DarwinAwardWinner/6771922

I have strategically placed print statementsin the code in order todemonstrate that preloading is happening. For example, I get thefollowing when I run the script on my machine:


CHILD: Preloaded 1 yields.
CHILD: Sent 1 yields.
CHILD: Preloaded 2 yields.
CHILD: Sent 2 yields.
MAIN: Received 1 yields.
MAIN: Processing reads
CHILD: Preloaded 3 yields.
MAIN: Processed 1 yields.
CHILD: Sent 3 yields.
MAIN: Received 2 yields.
MAIN: Processing reads
CHILD: Preloaded 4 yields.
MAIN: Processed 2 yields.
CHILD: Sent 4 yields.
MAIN: Received 3 yields.
MAIN: Processing reads
CHILD: Preloaded 5 yields.
MAIN: Processed 3 yields.
CHILD: Sent 5 yields.
MAIN: Received 4 yields.
MAIN: Processing reads
CHILD: Preloaded 6 yields.
MAIN: Processed 4 yields.
CHILD: Sent 6 yields.
MAIN: Received 5 yields.
MAIN: Processing reads
MAIN: Processed 5 yields.
MAIN: Received 6 yields.
MAIN: Processing reads
MAIN: Processed 6 yields.

In the script, the child is reading the fastq file, and the main processis doing the "calculation" (which is just a sleep). As you can see, thechild is always a step or two ahead of the main script, so that wheneverthe main script asks for the next yield, it gets it immediately insteadof waiting for the child to read from the disk.


So, is this kind of feature appropriate for inclusion into BioConductor?

-Ryan Thompson

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Proof-of-concept parallel preloading FastqStreamer

Reply via email to