Hi Leonard, Sorry for the delay. I finally managed to spend some time optimizing readGAlignmentPairs(). In the latest GenomicAlignments (0.99.32), it's still using the new pairing algo but I got rid of some of the inefficiencies I introduced when I switched the code to using this algo. So now it's as fast as before the switch, but, thanks to the new pairing algo, it also uses less memory and supports reading the BAM file by chunk (via the 'yieldSize' arg of BamFile()).
Cheers, H. On 12/20/2013 11:53 AM, Hervé Pagès wrote:
Hi Leonard, What happened between GenomicAlignments 0.99.8 and GenomicAlignments 0.99.10 is that I modified readGAlignmentPairsFromBam() to use the new pairing algo (implemented in scanBam()) instead of the pairing algo implemented by findMateAlignment(): ------------------------------------------------------------------------ r84106 | hpa...@fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013) | 3 lines Modify readGAlignmentPairsFromBam() to use scanBam(BamFile(asMates=TRUE), ...) instead of findMateAlignment() for the pairing. ------------------------------------------------------------------------ The new pairing algo (implemented by Val and Martin) is fast and memory efficient, and, last but not least, supports 'yieldSize' for reading the BAM file by chunk. The same pairing algo is also used by readGAlignmentsListFromBam(). readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam() and then turns the returned GAlignmentsList object into a GAlignmentPairs object. I need to do some timings but I suspect this transformation is taking a little bit too long and there might be room for optimization (like e.g. avoiding the GAlignmentsList intermediate representation). I'll keep you updated. Thanks, H. On 12/19/2013 12:58 PM, Leonard Goldstein wrote:Hi Hervé Thanks for fixing this problem. It works fine now, however I did notice a drop in performance. For example reading in chromosome 1 can take about twice as long as previously (see below). Is this something that can be avoided? Many thanks again Leonard -- GenomicAlignments 0.99.8param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621))) system.time(readGAlignmentPairs(file, param = param))user system elapsed 161.282 7.812 169.442system.time(readGAlignmentPairs(file, param = param))user system elapsed 91.614 7.256 99.065system.time(readGAlignmentPairs(file, param = param))user system elapsed 89.285 7.461 96.940sessionInfo()R Under development (unstable) (2013-12-03 r64376) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicAlignments_0.99.8 Rsamtools_1.15.15 Biostrings_2.31.5 [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.17 [7] BiocGenerics_0.9.2 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0GenomicAlignments 0.99.10param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621))) system.time(readGAlignmentPairs(file, param = param))user system elapsed 265.624 8.812 274.990system.time(readGAlignmentPairs(file, param = param))user system elapsed 249.724 7.177 257.399system.time(readGAlignmentPairs(file, param = param))user system elapsed 247.476 6.648 254.621sessionInfo()R Under development (unstable) (2013-12-03 r64376) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicAlignments_0.99.10 Rsamtools_1.15.15 [3] Biostrings_2.31.5 GenomicRanges_1.15.17 [5] XVector_0.3.5 IRanges_1.21.17 [7] BiocGenerics_0.9.2 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpa...@fhcrc.org> wrote:Hi Leonard, This should be fixed in GenomicAlignments 0.99.10, which will become available thru biocLite() in the next 24 hours or so. In the mean time you can grab it directly from svn: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments Please let me know if you still run into problems with this. Thanks! H. On 12/19/2013 10:41 AM, Leonard Goldstein wrote:Hi Hervé You probably spotted this already but it looks like the problem is introduced between GenomicAlignments revisions r84052 (0.99.8) and r84106 (0.99.9) Best wishes Leonard On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <golds...@gene.com> wrote:Dear list, There seems to be a problem with the readGAlignmentPairs function: Querying genomic regions without any alignments using the which argument results in an error -- see (1) below. Reading in a whole chromosome takes indefinitely (or at least much longer than in previous versions) -- see (2) below. I suspect these issues are not specific to the BAM files I am working with but can provide test data if required. Many thanks for your help. Leonard -- (1) Attempts to read an empty region results in an error.gr <- GRanges("22", IRanges(1000000, 2000000)) param <- ScanBamParam(which = gr) readGAlignmentPairs(file, param = param)Error in `elementMetadata<-`(x, ..., value = value) : replacement 'elementMetadata' value must be a DataTable object or NULLsessionInfo()R Under development (unstable) (2013-12-03 r64376) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15 Biostrings_2.31.5 [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.16 [7] BiocGenerics_0.9.2 BiocInstaller_1.13.3 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.1.0 tools_3.1.0 zlibbioc_1.9.0... but works fine with previous versiongr <- GRanges("22", IRanges(1000000, 2000000)) param <- ScanBamParam(which = gr) readGAlignmentPairs(file, param = param)GAlignmentPairs with 0 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <Rle> <Rle> : <IRanges> -- <IRanges> --- seqlengths: 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 249250621 243199373 198022430 ... 36422 39786 38502sessionInfo()R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.2 Biostrings_2.30.1 GenomicRanges_1.14.4 [4] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.8.0 (2) Use of the which argument covering chromosome 22 takes under one minute with an earlier version of readGAlignmentPairsgr <- GRanges("22", IRanges(1, 51304566)) param <- ScanBamParam(which = gr) system.time(gap <- readGAlignmentPairs(file, param = param))user system elapsed 45.887 0.256 46.168sessionInfo()R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.2 Biostrings_2.30.1 GenomicRanges_1.14.4 [4] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.8.0... but at least twenty times as long with the current version.gr <- GRanges("22", IRanges(1, 51304566)) param <- ScanBamParam(which = gr) system.time(gap <- readGAlignmentPairs(file, param = param))^C Timing stopped at: 1108.041 35.998 1144.006sessionInfo()R Under development (unstable) (2013-12-03 r64376) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15 Biostrings_2.31.5 [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.16 [7] BiocGenerics_0.9.2 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel