Hi Hervé

It works great. Thanks for optimizing the function, it will be very
helpful when processing larger data sets.

Best wishes

Leonard


On Fri, Mar 7, 2014 at 8:49 AM, Hervé Pagès <hpa...@fhcrc.org> wrote:
> Hi Leonard,
>
> Sorry for the delay. I finally managed to spend some time optimizing
> readGAlignmentPairs(). In the latest GenomicAlignments (0.99.32),
> it's still using the new pairing algo but I got rid of some of the
> inefficiencies I introduced when I switched the code to using this
> algo. So now it's as fast as before the switch, but, thanks to the
> new pairing algo, it also uses less memory and supports reading the
> BAM file by chunk (via the 'yieldSize' arg of BamFile()).
>
> Cheers,
> H.
>
>
>
> On 12/20/2013 11:53 AM, Hervé Pagès wrote:
>>
>> Hi Leonard,
>>
>> What happened between GenomicAlignments 0.99.8 and
>> GenomicAlignments 0.99.10 is that I modified
>> readGAlignmentPairsFromBam() to use the new pairing algo
>> (implemented in scanBam()) instead of the pairing algo
>> implemented by findMateAlignment():
>>
>> ------------------------------------------------------------------------
>> r84106 | hpa...@fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013)
>> | 3 lines
>>
>> Modify readGAlignmentPairsFromBam() to use
>> scanBam(BamFile(asMates=TRUE), ...)
>> instead of findMateAlignment() for the pairing.
>>
>> ------------------------------------------------------------------------
>>
>> The new pairing algo (implemented by Val and Martin) is fast and memory
>> efficient, and, last but not least, supports 'yieldSize' for reading the
>> BAM file by chunk. The same pairing algo is also used by
>> readGAlignmentsListFromBam().
>>
>> readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam()
>> and then turns the returned GAlignmentsList object into a
>> GAlignmentPairs object. I need to do some timings but I suspect this
>> transformation is taking a little bit too long and there might be room
>> for optimization (like e.g. avoiding the GAlignmentsList intermediate
>> representation).
>>
>> I'll keep you updated.
>>
>> Thanks,
>> H.
>>
>>
>> On 12/19/2013 12:58 PM, Leonard Goldstein wrote:
>>>
>>> Hi Hervé
>>>
>>> Thanks for fixing this problem. It works fine now, however I did
>>> notice a drop in performance. For example reading in chromosome 1 can
>>> take about twice as long as previously (see below). Is this something
>>> that can be avoided?
>>>
>>> Many thanks again
>>>
>>> Leonard
>>>
>>> --
>>> GenomicAlignments 0.99.8
>>>
>>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 161.282   7.812 169.442
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>>   91.614   7.256  99.065
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>>   89.285   7.461  96.940
>>>>
>>>>
>>>> sessionInfo()
>>>
>>> R Under development (unstable) (2013-12-03 r64376)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicAlignments_0.99.8 Rsamtools_1.15.15        Biostrings_2.31.5
>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.17
>>> [7] BiocGenerics_0.9.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>
>>>>
>>>
>>> GenomicAlignments 0.99.10
>>>
>>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 265.624   8.812 274.990
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 249.724   7.177 257.399
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 247.476   6.648 254.621
>>>>
>>>>
>>>> sessionInfo()
>>>
>>> R Under development (unstable) (2013-12-03 r64376)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicAlignments_0.99.10 Rsamtools_1.15.15
>>> [3] Biostrings_2.31.5         GenomicRanges_1.15.17
>>> [5] XVector_0.3.5             IRanges_1.21.17
>>> [7] BiocGenerics_0.9.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>
>>>>
>>>
>>> On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpa...@fhcrc.org> wrote:
>>>>
>>>> Hi Leonard,
>>>>
>>>> This should be fixed in GenomicAlignments 0.99.10, which will become
>>>> available thru biocLite() in the next 24 hours or so. In the mean time
>>>> you can grab it directly from svn:
>>>>
>>>>
>>>>
>>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments
>>>>
>>>>
>>>> Please let me know if you still run into problems with this.
>>>> Thanks!
>>>>
>>>>
>>>> H.
>>>>
>>>> On 12/19/2013 10:41 AM, Leonard Goldstein wrote:
>>>>>
>>>>>
>>>>> Hi Hervé
>>>>>
>>>>> You probably spotted this already but it looks like the problem is
>>>>> introduced between GenomicAlignments revisions r84052 (0.99.8) and
>>>>> r84106 (0.99.9)
>>>>>
>>>>> Best wishes
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>> On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <golds...@gene.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> There seems to be a problem with the readGAlignmentPairs function:
>>>>>>
>>>>>> Querying genomic regions without any alignments using the which
>>>>>> argument results in an error -- see (1) below. Reading in a whole
>>>>>> chromosome takes indefinitely (or at least much longer than in
>>>>>> previous versions) -- see (2) below. I suspect these issues are not
>>>>>> specific to the BAM files I am working with but can provide test data
>>>>>> if required.
>>>>>>
>>>>>> Many thanks for your help.
>>>>>>
>>>>>> Leonard
>>>>>>
>>>>>>
>>>>>> --
>>>>>> (1) Attempts to read an empty region results in an error.
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> readGAlignmentPairs(file, param = param)
>>>>>>
>>>>>>
>>>>>> Error in `elementMetadata<-`(x, ..., value = value) :
>>>>>>     replacement 'elementMetadata' value must be a DataTable object
>>>>>> or NULL
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R Under development (unstable) (2013-12-03 r64376)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15
>>>>>> Biostrings_2.31.5
>>>>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
>>>>>> [7] BiocGenerics_0.9.2       BiocInstaller_1.13.3
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.1.0   tools_3.1.0    zlibbioc_1.9.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ... but works fine with previous version
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> readGAlignmentPairs(file, param = param)
>>>>>>
>>>>>>
>>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>>>    seqnames strand :    ranges --    ranges
>>>>>>       <Rle>  <Rle> : <IRanges> -- <IRanges>
>>>>>> ---
>>>>>> seqlengths:
>>>>>>             1          2          3 ... GL000247.1 GL000248.1
>>>>>> GL000249.1
>>>>>>     249250621  243199373  198022430 ...      36422      39786
>>>>>> 38502
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
>>>>>> [4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0
>>>>>>
>>>>>>
>>>>>> (2) Use of the which argument covering chromosome 22 takes under one
>>>>>> minute with an earlier version of readGAlignmentPairs
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>>>
>>>>>>
>>>>>>      user  system elapsed
>>>>>>    45.887   0.256  46.168
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
>>>>>> [4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ... but at least twenty times as long with the current version.
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>>>
>>>>>>
>>>>>>
>>>>>> ^C
>>>>>> Timing stopped at: 1108.041 35.998 1144.006
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R Under development (unstable) (2013-12-03 r64376)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15
>>>>>> Biostrings_2.31.5
>>>>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
>>>>>> [7] BiocGenerics_0.9.2
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpa...@fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to