Re: [Bioc-devel] Proof-of-concept parallel preloading FastqStreamer

Martin Morgan Wed, 02 Oct 2013 12:45:45 -0700

On 10/02/2013 11:58 AM, Gregoire Pau wrote:

Hello Ryan,


You may be interested in the function sclapply(...) located in the
HTSeqGenie package. sclapply is a multicore dispatcher that accepts 3 main
arguments (inext, fun, max.parallel.jobs). The data produced by the
function inext, executed in the main thread, is dispatched to fun(),
executed in a children thread. A built-in scheduler controls the maximal
number of threads.

In HTSeqGenie, inext(...) is typically an iterator to read chunks of FastQ
reads, which are passed to a function processing the FastQ reads (for
counting, QC, alignment...) in a children thread. sclapply(...) enables
multicore processing of iterator flows and offers performance gains almost
proportional to the number of cores. Moreover, the function is robust and
contains extra arguments to handle exceptions and periodical tracing (e.g.
to check memory usage).

Hope this can help,

I'd like to incorporate these ideas (distil Ryan's and Greg's) in toBiocParallel, as bpiterate or maybe bpstream (I think in the literature streamhas the notion of indeterminate, which isn't quite accurate). Let me know (on oroff list) if that's not ok.


It would be interesting to see support for other back-ends.

And to come up with a consistent error handling model, incorporating the workMichel has recently completed (not yet in BiocParallel) as part of GSOC


  https://github.com/Bioconductor/BiocParallel/pull/19

Martin


Cheers,

Greg


On Mon, Sep 30, 2013 at 5:00 PM, Ryan <r...@thompsonclan.org> wrote:

Hi all,

I have previously written an Rscript to read, filter, and write large
fastq files using FastqSteamer to read. Through some complicated tricks, I
was able to get the input to happen in parallel with the processing and
output (using parallel::mcparallel and friends). In other words, while my
script was processing and writing out the nth block of reads, another
process was reading the (n+1)th block of reads at the same time. This
almost doubled the speed of my script (the server had sufficient I/O
bandwidth to parallelize reads and writes to disk). Since then, I've been
wanting to generalize this pattern, and I have just now made a working
proof of concept. It is a wrapper for FastqStreamer that runs in a separate
process and uses parallel:::sendMaster to send each block to the main
script, and then calls yield on the FastqStreamer to preload the next block
while the script is processing the current one. You can view and download
the script here:

https://gist.github.com/**DarwinAwardWinner/6771922<https://gist.github.com/DarwinAwardWinner/6771922>

I have strategically placed print statementsin the code in order to
demonstrate that preloading is happening. For example, I get the following
when I run the script on my machine:

CHILD: Preloaded 1 yields.
CHILD: Sent 1 yields.
CHILD: Preloaded 2 yields.
CHILD: Sent 2 yields.
MAIN: Received 1 yields.
MAIN: Processing reads
CHILD: Preloaded 3 yields.
MAIN: Processed 1 yields.
CHILD: Sent 3 yields.
MAIN: Received 2 yields.
MAIN: Processing reads
CHILD: Preloaded 4 yields.
MAIN: Processed 2 yields.
CHILD: Sent 4 yields.
MAIN: Received 3 yields.
MAIN: Processing reads
CHILD: Preloaded 5 yields.
MAIN: Processed 3 yields.
CHILD: Sent 5 yields.
MAIN: Received 4 yields.
MAIN: Processing reads
CHILD: Preloaded 6 yields.
MAIN: Processed 4 yields.
CHILD: Sent 6 yields.
MAIN: Received 5 yields.
MAIN: Processing reads
MAIN: Processed 5 yields.
MAIN: Received 6 yields.
MAIN: Processing reads
MAIN: Processed 6 yields.

In the script, the child is reading the fastq file, and the main process
is doing the "calculation" (which is just a sleep). As you can see, the
child is always a step or two ahead of the main script, so that whenever
the main script asks for the next yield, it gets it immediately instead of
waiting for the child to read from the disk.

So, is this kind of feature appropriate for inclusion into BioConductor?

-Ryan Thompson

______________________________**_________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Proof-of-concept parallel preloading FastqStreamer

Reply via email to