I should also mention that when both 'yieldSize' in the BamFile and 'which' in ScanBamParam are set the 'which' gets priority. The point of 'yieldSize' is to provide a reasonable chunk for processing the file. When 'which' is provided it's assumed that range is of reasonable chunk size so the yield is ignored.

I've added this info to the 'summarizeOverlaps' and 'readGAlignments' man pages in GenomicAlignments.

Valerie

On 03/27/14 08:30, Valerie Obenchain wrote:
Hi Mike,

This is fixed in Rsamtools 1.15.35.

The bug was related to when the mate-pairing was performed wrt meeting
the 'yieldSize' requirement. Thanks for sending the file and
reproducible example.

The file has ~115 million records:

fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
countBam(fl)$records
 [1] 114943975

To process the complete file with a yield size of 1e6 took ~ 18 GIG and
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)

bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
     summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)

system.time(SO(bf))
     user   system  elapsed
 1545.684   12.412 1558.498

Thanks for reporting the bug.

Valerie



On 03/21/14 13:55, Michael Love wrote:
hi Valerie,

Thanks. I'm trying now to make use of the new mate pairing algorithm but
keeping running out of memory (i requested 50 Gb) and getting my job
killed. I wonder if you could try this code/example below?

If the new C code is faster for paired-end than the
pre-sorting/obeyQname method that would be great (eliminates the need to
have the extra qname-sorted Bam), but it seems to me that with the old
method, it was easier to specify hard limits on memory. Maybe I am
missing something though. :)

Here's an RNA-Seq sample from Encode, and then I run samtools index
locally.

from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/


I download the hg19 paired end reads:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam


library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grl <- exonsBy(txdb, by="gene")
library("Rsamtools")
bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
library("GenomicAlignments")
system.time({so <- summarizeOverlaps(grl, bf,
                                      ignore.strand=TRUE,
                                      singleEnd=FALSE)})



 > sessionInfo()
R Under development (unstable) (2014-03-18 r65213)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
  [2] GenomicFeatures_1.15.11
  [3] AnnotationDbi_1.25.15
  [4] Biobase_2.23.6
  [5] GenomicAlignments_0.99.32
  [6] BSgenome_1.31.12
  [7] Rsamtools_1.15.33
  [8] Biostrings_2.31.14
  [9] XVector_0.3.7
[10] GenomicRanges_1.15.39
[11] GenomeInfoDb_0.99.19
[12] IRanges_1.21.34
[13] BiocGenerics_0.9.3
[14] Defaults_1.1-1
[15] devtools_1.4.1
[16] knitr_1.5
[17] BiocInstaller_1.13.3

loaded via a namespace (and not attached):
  [1] BatchJobs_1.2       BBmisc_1.5          BiocParallel_0.5.17
  [4] biomaRt_2.19.3      bitops_1.0-6        brew_1.0-6
  [7] codetools_0.2-8     DBI_0.2-7           digest_0.6.4
[10] evaluate_0.5.1      fail_1.2            foreach_1.4.1
[13] formatR_0.10        httr_0.2            iterators_1.0.6
[16] memoise_0.1         plyr_1.8.1          Rcpp_0.11.1
[19] RCurl_1.95-4.1      RSQLite_0.11.4      rtracklayer_1.23.18
[22] sendmailR_1.1-2     stats4_3.2.0        stringr_0.6.2
[25] tools_3.2.0         whisker_0.3-2       XML_3.98-1.1
[28] zlibbioc_1.9.0
 >





On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <voben...@fhcrc.org
<mailto:voben...@fhcrc.org>> wrote:

    On 03/19/14 10:24, Michael Love wrote:

        hi Valerie,

        If the Bam is not sorted by name, isn't it possible that
        readGAlignment*
        will load > yieldSize number of reads in order to find the mate?


    Sorry, our emails keep criss-crossing.

    Because the mate-pairing is now done in C yieldSize is no longer a
    constraint.

    When yieldSize is given, say 100, then 100 mates are returned. The
    algo moves through the file until 100 records are successfully
    paired. These 100 are yielded to the user and the code picks up
    where it left off. A related situation is the 'which' in a param. In
    this case you want the mates in a particular range. The algo moves
    through the range and pairs what it can. If it's left with unmated
    records it goes outside the range looking for them.
    readGAlignmentsList() will return all records in this range, mated
    or not. The metadata column of 'mate_status' indicates the different
    groupings.

    Valerie



        Mike


        On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain
        <voben...@fhcrc.org <mailto:voben...@fhcrc.org>
        <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>> wrote:

             Hi Mike,

             You no longer need to sort Bam files to use the pairing
algo or
             yieldSize. The readGAlignment* functions now work with both
             constraints out of the box.

             Create a BamFile with yieldSize and indicate you want mates.
             bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)

             Maybe set some specifications in a param:
             param <- ScanBamParam(what = c("qname", "flag"))

             Then call either readGAlignment* method that handles pairs:
             readGAlignmentsList(bf, param=param)
             readGAlignmentPairs(bf, param=param)

             For summarizeOverlaps():
             summarizeOverlaps(annotation, bf, param=param,
singleEnd=FALSE)

             We've considered removing the 'obeyQname' arg and
        documentation but
             thought the concept may be useful in another application.
I'll
             revisit the summarizeOverlaps() documentation to make sure
             'obeyQname' is downplayed and 'asMates' is encouraged.

             Valerie




             On 03/19/14 07:39, Michael Love wrote:

                 hi,

                      From last year, in order to use yieldSize with
        paired-end
                     BAMs, I

                 should sort the BAMs by qname and then use the
        following call to
                 BamFile:

                 library(pasillaBamSubset)
                 fl <- sortBam(untreated3_chr4(), tempfile(),
byQname=TRUE)
                 bf <- BamFile(fl, index=character(0), yieldSize=3,
        obeyQname=TRUE)


https://stat.ethz.ch/____pipermail/bioconductor/2013-____March/051490.html


<https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html>


<https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html

<https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>>

                 If I want to use
        GenomicAlignments::____readGAlignmentsList with

                 asMates=TRUE and respecting the yieldSize, what is the
        proper
                 construction? (in the end, I want to use
        summarizeOverlaps on
                 paired-end BAMs while respecting the yieldSize)

                 library(pasillaBamSubset)
                 fl <- sortBam(untreated3_chr4(), tempfile(),
byQname=TRUE)
                 bf <- BamFile(fl, index=character(0), yieldSize=3,
                 obeyQname=TRUE, asMates=TRUE)
                 x <- readGAlignmentsList(bf)
                 Warning message:
                    In scanBam(bamfile, ..., param = param) :
                      'obeyQname=TRUE' ignored when 'asMates=TRUE'
                    Calls: readGAlignmentsList ... .matesFromBam ->
                 .load_bamcols_from_bamfile -> scanBam -> scanBam

                 I see in the man pages for summarizeOverlaps it has:

                 "In Bioconductor > 2.12 it is not
                 necessary to sort paired-end BAM files by ‘qname’. When
                 counting with ‘summarizeOverlaps’, setting
        ‘singleEnd=FALSE’
                 will trigger paired-end reading and counting."

                 but I don't see how this can respect the specified
        yieldSize,
                 because
                 readGAlignmentsList has to read in as many reads as
        necessary to
                 find
                 the mate.

                 Sorry in advance if I am missing something in the
        documentation!

                 Mike

                 ___________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>
                 mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                 <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>






_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to