Hi Hervé It works great. Thanks for optimizing the function, it will be very helpful when processing larger data sets.
Best wishes Leonard On Fri, Mar 7, 2014 at 8:49 AM, Hervé Pagès <hpa...@fhcrc.org> wrote: > Hi Leonard, > > Sorry for the delay. I finally managed to spend some time optimizing > readGAlignmentPairs(). In the latest GenomicAlignments (0.99.32), > it's still using the new pairing algo but I got rid of some of the > inefficiencies I introduced when I switched the code to using this > algo. So now it's as fast as before the switch, but, thanks to the > new pairing algo, it also uses less memory and supports reading the > BAM file by chunk (via the 'yieldSize' arg of BamFile()). > > Cheers, > H. > > > > On 12/20/2013 11:53 AM, Hervé Pagès wrote: >> >> Hi Leonard, >> >> What happened between GenomicAlignments 0.99.8 and >> GenomicAlignments 0.99.10 is that I modified >> readGAlignmentPairsFromBam() to use the new pairing algo >> (implemented in scanBam()) instead of the pairing algo >> implemented by findMateAlignment(): >> >> ------------------------------------------------------------------------ >> r84106 | hpa...@fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013) >> | 3 lines >> >> Modify readGAlignmentPairsFromBam() to use >> scanBam(BamFile(asMates=TRUE), ...) >> instead of findMateAlignment() for the pairing. >> >> ------------------------------------------------------------------------ >> >> The new pairing algo (implemented by Val and Martin) is fast and memory >> efficient, and, last but not least, supports 'yieldSize' for reading the >> BAM file by chunk. The same pairing algo is also used by >> readGAlignmentsListFromBam(). >> >> readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam() >> and then turns the returned GAlignmentsList object into a >> GAlignmentPairs object. I need to do some timings but I suspect this >> transformation is taking a little bit too long and there might be room >> for optimization (like e.g. avoiding the GAlignmentsList intermediate >> representation). >> >> I'll keep you updated. >> >> Thanks, >> H. >> >> >> On 12/19/2013 12:58 PM, Leonard Goldstein wrote: >>> >>> Hi Hervé >>> >>> Thanks for fixing this problem. It works fine now, however I did >>> notice a drop in performance. For example reading in chromosome 1 can >>> take about twice as long as previously (see below). Is this something >>> that can be avoided? >>> >>> Many thanks again >>> >>> Leonard >>> >>> -- >>> GenomicAlignments 0.99.8 >>> >>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621))) >>>> >>>> system.time(readGAlignmentPairs(file, param = param)) >>> >>> user system elapsed >>> 161.282 7.812 169.442 >>>> >>>> system.time(readGAlignmentPairs(file, param = param)) >>> >>> user system elapsed >>> 91.614 7.256 99.065 >>>> >>>> system.time(readGAlignmentPairs(file, param = param)) >>> >>> user system elapsed >>> 89.285 7.461 96.940 >>>> >>>> >>>> sessionInfo() >>> >>> R Under development (unstable) (2013-12-03 r64376) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] GenomicAlignments_0.99.8 Rsamtools_1.15.15 Biostrings_2.31.5 >>> [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.17 >>> [7] BiocGenerics_0.9.2 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0 >>>> >>>> >>> >>> GenomicAlignments 0.99.10 >>> >>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621))) >>>> >>>> system.time(readGAlignmentPairs(file, param = param)) >>> >>> user system elapsed >>> 265.624 8.812 274.990 >>>> >>>> system.time(readGAlignmentPairs(file, param = param)) >>> >>> user system elapsed >>> 249.724 7.177 257.399 >>>> >>>> system.time(readGAlignmentPairs(file, param = param)) >>> >>> user system elapsed >>> 247.476 6.648 254.621 >>>> >>>> >>>> sessionInfo() >>> >>> R Under development (unstable) (2013-12-03 r64376) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] GenomicAlignments_0.99.10 Rsamtools_1.15.15 >>> [3] Biostrings_2.31.5 GenomicRanges_1.15.17 >>> [5] XVector_0.3.5 IRanges_1.21.17 >>> [7] BiocGenerics_0.9.2 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0 >>>> >>>> >>> >>> On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpa...@fhcrc.org> wrote: >>>> >>>> Hi Leonard, >>>> >>>> This should be fixed in GenomicAlignments 0.99.10, which will become >>>> available thru biocLite() in the next 24 hours or so. In the mean time >>>> you can grab it directly from svn: >>>> >>>> >>>> >>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments >>>> >>>> >>>> Please let me know if you still run into problems with this. >>>> Thanks! >>>> >>>> >>>> H. >>>> >>>> On 12/19/2013 10:41 AM, Leonard Goldstein wrote: >>>>> >>>>> >>>>> Hi Hervé >>>>> >>>>> You probably spotted this already but it looks like the problem is >>>>> introduced between GenomicAlignments revisions r84052 (0.99.8) and >>>>> r84106 (0.99.9) >>>>> >>>>> Best wishes >>>>> >>>>> Leonard >>>>> >>>>> >>>>> On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <golds...@gene.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> Dear list, >>>>>> >>>>>> There seems to be a problem with the readGAlignmentPairs function: >>>>>> >>>>>> Querying genomic regions without any alignments using the which >>>>>> argument results in an error -- see (1) below. Reading in a whole >>>>>> chromosome takes indefinitely (or at least much longer than in >>>>>> previous versions) -- see (2) below. I suspect these issues are not >>>>>> specific to the BAM files I am working with but can provide test data >>>>>> if required. >>>>>> >>>>>> Many thanks for your help. >>>>>> >>>>>> Leonard >>>>>> >>>>>> >>>>>> -- >>>>>> (1) Attempts to read an empty region results in an error. >>>>>> >>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000)) >>>>>>> >>>>>>> param <- ScanBamParam(which = gr) >>>>>>> >>>>>>> readGAlignmentPairs(file, param = param) >>>>>> >>>>>> >>>>>> Error in `elementMetadata<-`(x, ..., value = value) : >>>>>> replacement 'elementMetadata' value must be a DataTable object >>>>>> or NULL >>>>>>> >>>>>>> >>>>>>> >>>>>>> sessionInfo() >>>>>> >>>>>> >>>>>> R Under development (unstable) (2013-12-03 r64376) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15 >>>>>> Biostrings_2.31.5 >>>>>> [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.16 >>>>>> [7] BiocGenerics_0.9.2 BiocInstaller_1.13.3 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] bitops_1.0-6 stats4_3.1.0 tools_3.1.0 zlibbioc_1.9.0 >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> ... but works fine with previous version >>>>>> >>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000)) >>>>>>> >>>>>>> param <- ScanBamParam(which = gr) >>>>>>> >>>>>>> readGAlignmentPairs(file, param = param) >>>>>> >>>>>> >>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>>>>> seqnames strand : ranges -- ranges >>>>>> <Rle> <Rle> : <IRanges> -- <IRanges> >>>>>> --- >>>>>> seqlengths: >>>>>> 1 2 3 ... GL000247.1 GL000248.1 >>>>>> GL000249.1 >>>>>> 249250621 243199373 198022430 ... 36422 39786 >>>>>> 38502 >>>>>>> >>>>>>> >>>>>>> >>>>>>> sessionInfo() >>>>>> >>>>>> >>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsamtools_1.14.2 Biostrings_2.30.1 GenomicRanges_1.14.4 >>>>>> [4] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.8.0 >>>>>> >>>>>> >>>>>> (2) Use of the which argument covering chromosome 22 takes under one >>>>>> minute with an earlier version of readGAlignmentPairs >>>>>> >>>>>>> gr <- GRanges("22", IRanges(1, 51304566)) >>>>>>> >>>>>>> param <- ScanBamParam(which = gr) >>>>>>> >>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param)) >>>>>> >>>>>> >>>>>> user system elapsed >>>>>> 45.887 0.256 46.168 >>>>>>> >>>>>>> >>>>>>> >>>>>>> sessionInfo() >>>>>> >>>>>> >>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsamtools_1.14.2 Biostrings_2.30.1 GenomicRanges_1.14.4 >>>>>> [4] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.8.0 >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> ... but at least twenty times as long with the current version. >>>>>> >>>>>>> gr <- GRanges("22", IRanges(1, 51304566)) >>>>>>> >>>>>>> param <- ScanBamParam(which = gr) >>>>>>> >>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param)) >>>>>> >>>>>> >>>>>> >>>>>> ^C >>>>>> Timing stopped at: 1108.041 35.998 1144.006 >>>>>>> >>>>>>> >>>>>>> >>>>>>> sessionInfo() >>>>>> >>>>>> >>>>>> R Under development (unstable) (2013-12-03 r64376) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15 >>>>>> Biostrings_2.31.5 >>>>>> [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.16 >>>>>> [7] BiocGenerics_0.9.2 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0 >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpa...@fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel