Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

Michael Love Fri, 21 Mar 2014 13:56:45 -0700

hi Valerie,

Thanks. I'm trying now to make use of the new mate pairing algorithm but
keeping running out of memory (i requested 50 Gb) and getting my job
killed. I wonder if you could try this code/example below?


If the new C code is faster for paired-end than the pre-sorting/obeyQname
method that would be great (eliminates the need to have the extra
qname-sorted Bam), but it seems to me that with the old method, it was
easier to specify hard limits on memory. Maybe I am missing something
though. :)

Here's an RNA-Seq sample from Encode, and then I run samtools index locally.

from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/

I download the hg19 paired end reads:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam

library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grl <- exonsBy(txdb, by="gene")
library("Rsamtools")
bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
library("GenomicAlignments")
system.time({so <- summarizeOverlaps(grl, bf,
                                     ignore.strand=TRUE,
                                     singleEnd=FALSE)})



> sessionInfo()
R Under development (unstable) (2014-03-18 r65213)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
 [2] GenomicFeatures_1.15.11
 [3] AnnotationDbi_1.25.15
 [4] Biobase_2.23.6
 [5] GenomicAlignments_0.99.32
 [6] BSgenome_1.31.12
 [7] Rsamtools_1.15.33
 [8] Biostrings_2.31.14
 [9] XVector_0.3.7
[10] GenomicRanges_1.15.39
[11] GenomeInfoDb_0.99.19
[12] IRanges_1.21.34
[13] BiocGenerics_0.9.3
[14] Defaults_1.1-1
[15] devtools_1.4.1
[16] knitr_1.5
[17] BiocInstaller_1.13.3

loaded via a namespace (and not attached):
 [1] BatchJobs_1.2       BBmisc_1.5          BiocParallel_0.5.17
 [4] biomaRt_2.19.3      bitops_1.0-6        brew_1.0-6
 [7] codetools_0.2-8     DBI_0.2-7           digest_0.6.4
[10] evaluate_0.5.1      fail_1.2            foreach_1.4.1
[13] formatR_0.10        httr_0.2            iterators_1.0.6
[16] memoise_0.1         plyr_1.8.1          Rcpp_0.11.1
[19] RCurl_1.95-4.1      RSQLite_0.11.4      rtracklayer_1.23.18
[22] sendmailR_1.1-2     stats4_3.2.0        stringr_0.6.2
[25] tools_3.2.0         whisker_0.3-2       XML_3.98-1.1
[28] zlibbioc_1.9.0
>





On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <voben...@fhcrc.org>wrote:

> On 03/19/14 10:24, Michael Love wrote:
>
>> hi Valerie,
>>
>> If the Bam is not sorted by name, isn't it possible that readGAlignment*
>> will load > yieldSize number of reads in order to find the mate?
>>
>
> Sorry, our emails keep criss-crossing.
>
> Because the mate-pairing is now done in C yieldSize is no longer a
> constraint.
>
> When yieldSize is given, say 100, then 100 mates are returned. The algo
> moves through the file until 100 records are successfully paired. These 100
> are yielded to the user and the code picks up where it left off. A related
> situation is the 'which' in a param. In this case you want the mates in a
> particular range. The algo moves through the range and pairs what it can.
> If it's left with unmated records it goes outside the range looking for
> them. readGAlignmentsList() will return all records in this range, mated or
> not. The metadata column of 'mate_status' indicates the different groupings.
>
> Valerie
>
>
>
>> Mike
>>
>>
>> On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain <voben...@fhcrc.org
>> <mailto:voben...@fhcrc.org>> wrote:
>>
>>     Hi Mike,
>>
>>     You no longer need to sort Bam files to use the pairing algo or
>>     yieldSize. The readGAlignment* functions now work with both
>>     constraints out of the box.
>>
>>     Create a BamFile with yieldSize and indicate you want mates.
>>     bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>>
>>     Maybe set some specifications in a param:
>>     param <- ScanBamParam(what = c("qname", "flag"))
>>
>>     Then call either readGAlignment* method that handles pairs:
>>     readGAlignmentsList(bf, param=param)
>>     readGAlignmentPairs(bf, param=param)
>>
>>     For summarizeOverlaps():
>>     summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)
>>
>>     We've considered removing the 'obeyQname' arg and documentation but
>>     thought the concept may be useful in another application. I'll
>>     revisit the summarizeOverlaps() documentation to make sure
>>     'obeyQname' is downplayed and 'asMates' is encouraged.
>>
>>     Valerie
>>
>>
>>
>>
>>     On 03/19/14 07:39, Michael Love wrote:
>>
>>         hi,
>>
>>              From last year, in order to use yieldSize with paired-end
>>             BAMs, I
>>
>>         should sort the BAMs by qname and then use the following call to
>>         BamFile:
>>
>>         library(pasillaBamSubset)
>>         fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
>>         bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)
>>
>>         https://stat.ethz.ch/__pipermail/bioconductor/2013-__
>> March/051490.html
>>         <https://stat.ethz.ch/pipermail/bioconductor/2013-
>> March/051490.html>
>>
>>         If I want to use GenomicAlignments::__readGAlignmentsList with
>>
>>         asMates=TRUE and respecting the yieldSize, what is the proper
>>         construction? (in the end, I want to use summarizeOverlaps on
>>         paired-end BAMs while respecting the yieldSize)
>>
>>         library(pasillaBamSubset)
>>         fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
>>         bf <- BamFile(fl, index=character(0), yieldSize=3,
>>         obeyQname=TRUE, asMates=TRUE)
>>         x <- readGAlignmentsList(bf)
>>         Warning message:
>>            In scanBam(bamfile, ..., param = param) :
>>              'obeyQname=TRUE' ignored when 'asMates=TRUE'
>>            Calls: readGAlignmentsList ... .matesFromBam ->
>>         .load_bamcols_from_bamfile -> scanBam -> scanBam
>>
>>         I see in the man pages for summarizeOverlaps it has:
>>
>>         "In Bioconductor > 2.12 it is not
>>         necessary to sort paired-end BAM files by âqnameâ. When
>>         counting with âsummarizeOverlapsâ, setting âsingleEnd=FALSEâ
>>         will trigger paired-end reading and counting."
>>
>>         but I don't see how this can respect the specified yieldSize,
>>         because
>>         readGAlignmentsList has to read in as many reads as necessary to
>>         find
>>         the mate.
>>
>>         Sorry in advance if I am missing something in the documentation!
>>
>>         Mike
>>
>>         _________________________________________________
>>         Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
>>         mailing list
>>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

Reply via email to