hi Valerie,

Thanks. I'm trying now to make use of the new mate pairing algorithm but
keeping running out of memory (i requested 50 Gb) and getting my job
killed. I wonder if you could try this code/example below?

If the new C code is faster for paired-end than the pre-sorting/obeyQname
method that would be great (eliminates the need to have the extra
qname-sorted Bam), but it seems to me that with the old method, it was
easier to specify hard limits on memory. Maybe I am missing something
though. :)

Here's an RNA-Seq sample from Encode, and then I run samtools index locally.

from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/

I download the hg19 paired end reads:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam

library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grl <- exonsBy(txdb, by="gene")
library("Rsamtools")
bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
library("GenomicAlignments")
system.time({so <- summarizeOverlaps(grl, bf,
                                     ignore.strand=TRUE,
                                     singleEnd=FALSE)})



> sessionInfo()
R Under development (unstable) (2014-03-18 r65213)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
 [2] GenomicFeatures_1.15.11
 [3] AnnotationDbi_1.25.15
 [4] Biobase_2.23.6
 [5] GenomicAlignments_0.99.32
 [6] BSgenome_1.31.12
 [7] Rsamtools_1.15.33
 [8] Biostrings_2.31.14
 [9] XVector_0.3.7
[10] GenomicRanges_1.15.39
[11] GenomeInfoDb_0.99.19
[12] IRanges_1.21.34
[13] BiocGenerics_0.9.3
[14] Defaults_1.1-1
[15] devtools_1.4.1
[16] knitr_1.5
[17] BiocInstaller_1.13.3

loaded via a namespace (and not attached):
 [1] BatchJobs_1.2       BBmisc_1.5          BiocParallel_0.5.17
 [4] biomaRt_2.19.3      bitops_1.0-6        brew_1.0-6
 [7] codetools_0.2-8     DBI_0.2-7           digest_0.6.4
[10] evaluate_0.5.1      fail_1.2            foreach_1.4.1
[13] formatR_0.10        httr_0.2            iterators_1.0.6
[16] memoise_0.1         plyr_1.8.1          Rcpp_0.11.1
[19] RCurl_1.95-4.1      RSQLite_0.11.4      rtracklayer_1.23.18
[22] sendmailR_1.1-2     stats4_3.2.0        stringr_0.6.2
[25] tools_3.2.0         whisker_0.3-2       XML_3.98-1.1
[28] zlibbioc_1.9.0
>





On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <voben...@fhcrc.org>wrote:

> On 03/19/14 10:24, Michael Love wrote:
>
>> hi Valerie,
>>
>> If the Bam is not sorted by name, isn't it possible that readGAlignment*
>> will load > yieldSize number of reads in order to find the mate?
>>
>
> Sorry, our emails keep criss-crossing.
>
> Because the mate-pairing is now done in C yieldSize is no longer a
> constraint.
>
> When yieldSize is given, say 100, then 100 mates are returned. The algo
> moves through the file until 100 records are successfully paired. These 100
> are yielded to the user and the code picks up where it left off. A related
> situation is the 'which' in a param. In this case you want the mates in a
> particular range. The algo moves through the range and pairs what it can.
> If it's left with unmated records it goes outside the range looking for
> them. readGAlignmentsList() will return all records in this range, mated or
> not. The metadata column of 'mate_status' indicates the different groupings.
>
> Valerie
>
>
>
>> Mike
>>
>>
>> On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain <voben...@fhcrc.org
>> <mailto:voben...@fhcrc.org>> wrote:
>>
>>     Hi Mike,
>>
>>     You no longer need to sort Bam files to use the pairing algo or
>>     yieldSize. The readGAlignment* functions now work with both
>>     constraints out of the box.
>>
>>     Create a BamFile with yieldSize and indicate you want mates.
>>     bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>>
>>     Maybe set some specifications in a param:
>>     param <- ScanBamParam(what = c("qname", "flag"))
>>
>>     Then call either readGAlignment* method that handles pairs:
>>     readGAlignmentsList(bf, param=param)
>>     readGAlignmentPairs(bf, param=param)
>>
>>     For summarizeOverlaps():
>>     summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)
>>
>>     We've considered removing the 'obeyQname' arg and documentation but
>>     thought the concept may be useful in another application. I'll
>>     revisit the summarizeOverlaps() documentation to make sure
>>     'obeyQname' is downplayed and 'asMates' is encouraged.
>>
>>     Valerie
>>
>>
>>
>>
>>     On 03/19/14 07:39, Michael Love wrote:
>>
>>         hi,
>>
>>              From last year, in order to use yieldSize with paired-end
>>             BAMs, I
>>
>>         should sort the BAMs by qname and then use the following call to
>>         BamFile:
>>
>>         library(pasillaBamSubset)
>>         fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
>>         bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)
>>
>>         https://stat.ethz.ch/__pipermail/bioconductor/2013-__
>> March/051490.html
>>         <https://stat.ethz.ch/pipermail/bioconductor/2013-
>> March/051490.html>
>>
>>         If I want to use GenomicAlignments::__readGAlignmentsList with
>>
>>         asMates=TRUE and respecting the yieldSize, what is the proper
>>         construction? (in the end, I want to use summarizeOverlaps on
>>         paired-end BAMs while respecting the yieldSize)
>>
>>         library(pasillaBamSubset)
>>         fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
>>         bf <- BamFile(fl, index=character(0), yieldSize=3,
>>         obeyQname=TRUE, asMates=TRUE)
>>         x <- readGAlignmentsList(bf)
>>         Warning message:
>>            In scanBam(bamfile, ..., param = param) :
>>              'obeyQname=TRUE' ignored when 'asMates=TRUE'
>>            Calls: readGAlignmentsList ... .matesFromBam ->
>>         .load_bamcols_from_bamfile -> scanBam -> scanBam
>>
>>         I see in the man pages for summarizeOverlaps it has:
>>
>>         "In Bioconductor > 2.12 it is not
>>         necessary to sort paired-end BAM files by ‘qname’. When
>>         counting with ‘summarizeOverlaps’, setting ‘singleEnd=FALSE’
>>         will trigger paired-end reading and counting."
>>
>>         but I don't see how this can respect the specified yieldSize,
>>         because
>>         readGAlignmentsList has to read in as many reads as necessary to
>>         find
>>         the mate.
>>
>>         Sorry in advance if I am missing something in the documentation!
>>
>>         Mike
>>
>>         _________________________________________________
>>         Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
>>         mailing list
>>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to