[Bioc-devel] Parallel processing of reads in a single fastq file

Johnston, Jeffrey Wed, 06 Aug 2014 07:19:47 -0700

Hi,

I have been using FastqStreamer() and yield() to process a large fastq file in 
chunks, modifying both the read and name and then appending the output to a new 
fastq file as each chunk is processed. This works great, but would benefit 
greatly from being parallelized.


As far as I can tell, this problem isnt easily solved with the existing 
parallel tools because you cant determine how many jobs youll need in advance 
(you just call yield() until it stops returning reads).

After some digging, I found the sclapply() function in the HTSeqGenie package 
by Gregoire Pau, which he describes as a multicore dispatcher:

https://stat.ethz.ch/pipermail/bioc-devel/2013-October/004754.html

I wasnt able to get the package to install from source due to some 
dependencies (there are no binaries for Mac), but I did extract the function 
and adapt it slightly for my use case. Heres an example:

processChunk <- function(fq_chunk) {
  # manipulate fastq reads here
}

yieldHelper <- function() {
  fq <- yield(fqstream)
  if(length(fq) == 0) return(NULL)
  fq
}

fqstream <- FastqStreamer(, n=1e6)
sclapply(yieldHelper, processChunk, max.parallel.jobs=4)
close(fqstream)

Based on the discussion linked above, it seems like there was some interest in 
integrating this idea into BiocParallel. I would find that very useful as it 
improves performance quite a bit and can likely be applied to numerous 
stream-based processing tasks.

I will point out that in my case above, the processChunk() function doesnt 
return anything. Instead it appends the modified fastq records to a new file. I 
have to use the Unix lockfile command to ensure that only one child process 
appends to the output file at a time. I am not certain if there is a more 
elegant solution to this (perhaps a queue that is emptied by a dedicated writer 
process?).

Thanks,
Jeff




        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Parallel processing of reads in a single fastq file

Reply via email to