Hi Leonard,

What happened between GenomicAlignments 0.99.8 and
GenomicAlignments 0.99.10 is that I modified
readGAlignmentPairsFromBam() to use the new pairing algo
(implemented in scanBam()) instead of the pairing algo
implemented by findMateAlignment():

------------------------------------------------------------------------
r84106 | hpa...@fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013) | 3 lines

Modify readGAlignmentPairsFromBam() to use scanBam(BamFile(asMates=TRUE), ...)
instead of findMateAlignment() for the pairing.

------------------------------------------------------------------------

The new pairing algo (implemented by Val and Martin) is fast and memory
efficient, and, last but not least, supports 'yieldSize' for reading the
BAM file by chunk. The same pairing algo is also used by
readGAlignmentsListFromBam().

readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam()
and then turns the returned GAlignmentsList object into a
GAlignmentPairs object. I need to do some timings but I suspect this
transformation is taking a little bit too long and there might be room
for optimization (like e.g. avoiding the GAlignmentsList intermediate
representation).

I'll keep you updated.

Thanks,
H.


On 12/19/2013 12:58 PM, Leonard Goldstein wrote:
Hi Hervé

Thanks for fixing this problem. It works fine now, however I did
notice a drop in performance. For example reading in chromosome 1 can
take about twice as long as previously (see below). Is this something
that can be avoided?

Many thanks again

Leonard

--
GenomicAlignments 0.99.8

param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))

system.time(readGAlignmentPairs(file, param = param))
    user  system elapsed
161.282   7.812 169.442
system.time(readGAlignmentPairs(file, param = param))
    user  system elapsed
  91.614   7.256  99.065
system.time(readGAlignmentPairs(file, param = param))
    user  system elapsed
  89.285   7.461  96.940

sessionInfo()
R Under development (unstable) (2013-12-03 r64376)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicAlignments_0.99.8 Rsamtools_1.15.15        Biostrings_2.31.5
[4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.17
[7] BiocGenerics_0.9.2

loaded via a namespace (and not attached):
[1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0


GenomicAlignments 0.99.10

param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))

system.time(readGAlignmentPairs(file, param = param))
    user  system elapsed
265.624   8.812 274.990
system.time(readGAlignmentPairs(file, param = param))
    user  system elapsed
249.724   7.177 257.399
system.time(readGAlignmentPairs(file, param = param))
    user  system elapsed
247.476   6.648 254.621

sessionInfo()
R Under development (unstable) (2013-12-03 r64376)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicAlignments_0.99.10 Rsamtools_1.15.15
[3] Biostrings_2.31.5         GenomicRanges_1.15.17
[5] XVector_0.3.5             IRanges_1.21.17
[7] BiocGenerics_0.9.2

loaded via a namespace (and not attached):
[1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0


On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpa...@fhcrc.org> wrote:
Hi Leonard,

This should be fixed in GenomicAlignments 0.99.10, which will become
available thru biocLite() in the next 24 hours or so. In the mean time
you can grab it directly from svn:


https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments

Please let me know if you still run into problems with this.
Thanks!


H.

On 12/19/2013 10:41 AM, Leonard Goldstein wrote:

Hi Hervé

You probably spotted this already but it looks like the problem is
introduced between GenomicAlignments revisions r84052 (0.99.8) and
r84106 (0.99.9)

Best wishes

Leonard


On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <golds...@gene.com>
wrote:

Dear list,

There seems to be a problem with the readGAlignmentPairs function:

Querying genomic regions without any alignments using the which
argument results in an error -- see (1) below. Reading in a whole
chromosome takes indefinitely (or at least much longer than in
previous versions) -- see (2) below. I suspect these issues are not
specific to the BAM files I am working with but can provide test data
if required.

Many thanks for your help.

Leonard


--
(1) Attempts to read an empty region results in an error.

gr <- GRanges("22", IRanges(1000000, 2000000))

param <- ScanBamParam(which = gr)

readGAlignmentPairs(file, param = param)

Error in `elementMetadata<-`(x, ..., value = value) :
    replacement 'elementMetadata' value must be a DataTable object or NULL


sessionInfo()

R Under development (unstable) (2013-12-03 r64376)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
   [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicAlignments_0.99.9 Rsamtools_1.15.15        Biostrings_2.31.5
[4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
[7] BiocGenerics_0.9.2       BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] bitops_1.0-6   stats4_3.1.0   tools_3.1.0    zlibbioc_1.9.0



... but works fine with previous version

gr <- GRanges("22", IRanges(1000000, 2000000))

param <- ScanBamParam(which = gr)

readGAlignmentPairs(file, param = param)

GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
   seqnames strand :    ranges --    ranges
      <Rle>  <Rle> : <IRanges> -- <IRanges>
---
seqlengths:
            1          2          3 ... GL000247.1 GL000248.1 GL000249.1
    249250621  243199373  198022430 ...      36422      39786      38502


sessionInfo()

R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=C                 LC_NAME=C
   [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
[4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0


(2) Use of the which argument covering chromosome 22 takes under one
minute with an earlier version of readGAlignmentPairs

gr <- GRanges("22", IRanges(1, 51304566))

param <- ScanBamParam(which = gr)

system.time(gap <- readGAlignmentPairs(file, param = param))

     user  system elapsed
   45.887   0.256  46.168


sessionInfo()

R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=C                 LC_NAME=C
   [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
[4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0



... but at least twenty times as long with the current version.

gr <- GRanges("22", IRanges(1, 51304566))

param <- ScanBamParam(which = gr)

system.time(gap <- readGAlignmentPairs(file, param = param))


^C
Timing stopped at: 1108.041 35.998 1144.006


sessionInfo()

R Under development (unstable) (2013-12-03 r64376)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
   [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicAlignments_0.99.9 Rsamtools_1.15.15        Biostrings_2.31.5
[4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
[7] BiocGenerics_0.9.2

loaded via a namespace (and not attached):
[1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0



_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to