----- Nicolas Delhomme <delho...@embl.de> wrote: > Hej Bioc Core! > > There was some discussion last year about implementing a BamStreamer (à la > FastqStreamer), but I haven't seen anything like it in the current devel. > I've implemented the following function that should do the job for me - I > have many very large files, and I need to use a cluster with relatively few > RAM per node and a restrictive time allocation , so I want to parallelize the > reading of the BAM file to manage both. The example below is obviously not > affecting the RAM issue but I streamlined it to point out my issue.
Hi Nico -- I had viewed the 'BamStreamer' functionality to be implemented by bf <- open(BamFile("foo.bam", yieldSize=1234567)) while (length(x <- readGAlignments(bf))) {} close(bf) paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together; usually I'd parallelize over files bfl <- open(BamFileList(fls, yieldSize=1234567))) result <- mclapply(bfl, function(bf, anno, ...) { ## initialize, e.g., cnt <- integer(length(anno))) while(length(x <- readGAlignments(bf))) {} ## return, e.g., anno }) close(bfl) This doesn't address the efficiency of appending GAlignments. Martin > > ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){ > > ## create a stream > stopifnot(is(bamFile,"BamFile")) > > ## set the yieldSize if it is not set already > if(is.na(yieldSize(bamFile))){ > yieldSize(bamFile) <- yieldSize > } > > ## open it > open(bamFile) > > ## verb > if(verbose){ > message(paste("Streaming",basename(path(bamFile)))) > } > > ## create the output > out <- GappedAlignments() > > ## process it > while(length(chunk <- readBamGappedAlignments(bamFile))){ > if(verbose){ > message(paste("Processed",length(chunk),"reads")) > } > out <- c(out,chunk) > } > > ## close > close(bamFile) > > ## return > return(out) > } > > In the method above, the first iteration of combining the GappedAlignments: > > out <- c(out,chunk) takes: > > system.time(append(out,chunk)) > > user system elapsed > 123.704 0.060 124.011 > > whereas the second iteration (faked here) takes only (still long): > > system.time(append(chunk,chunk)) > > user system elapsed > 2.708 0.044 2.758 > > I suppose this has to do with the way > GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the > objects and all the related sanity checks. For the first iteration, the > seqlengths are different so I suppose that is what explains the 60X lag > compared to the second iteration. Due to the implementation of > GappedAlignments, I can't set the seqlengths programmatically in > GappedAlignments() which I imagine would have reduced the first iteration > lag; see the trials below: > > out <- GappedAlignments(seqlengths=seqlengths(chunk)) > > Error in GappedAlignments(seqlengths = seqlengths(chunk)) : > 'names(seqlengths)' incompatible with 'levels(seqnames)' > > out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) > > Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = > seqnames(chunk)) : > 'strand' must be specified when 'seqnames' is not empty > > out <- > GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") > > Error in validObject(.Object) : > invalid class “GappedAlignments” object: 1: invalid object for slot "strand" > in class "GappedAlignments": got class "character", should be or extend class > "Rle" > invalid class “GappedAlignments” object: 2: number of rows in DataTable > 'mcols(x)' must match length of 'x' > > I completely approve of such sanity checks; it seems that I'm just trying to > do something that it was not designed for :-) All I'm really interested in is > a way to stream my BAM file and I'm looking forward to any suggestion. I > especially don't want to re-invent the wheel if you have already planned > something. If you haven't I'd be glad to get some insight how I can walk > around that problem. > > My sessionInfo: > > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 > [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 > [7] BiocGenerics_0.7.2 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0 > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delho...@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel