hi Valerie, Thanks. I'm trying now to make use of the new mate pairing algorithm but keeping running out of memory (i requested 50 Gb) and getting my job killed. I wonder if you could try this code/example below?
If the new C code is faster for paired-end than the pre-sorting/obeyQname method that would be great (eliminates the need to have the extra qname-sorted Bam), but it seems to me that with the old method, it was easier to specify hard limits on memory. Maybe I am missing something though. :) Here's an RNA-Seq sample from Encode, and then I run samtools index locally. from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/ I download the hg19 paired end reads: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam library("GenomicFeatures") library("TxDb.Hsapiens.UCSC.hg19.knownGene") txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene grl <- exonsBy(txdb, by="gene") library("Rsamtools") bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam" bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE) library("GenomicAlignments") system.time({so <- summarizeOverlaps(grl, bf, ignore.strand=TRUE, singleEnd=FALSE)}) > sessionInfo() R Under development (unstable) (2014-03-18 r65213) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 [2] GenomicFeatures_1.15.11 [3] AnnotationDbi_1.25.15 [4] Biobase_2.23.6 [5] GenomicAlignments_0.99.32 [6] BSgenome_1.31.12 [7] Rsamtools_1.15.33 [8] Biostrings_2.31.14 [9] XVector_0.3.7 [10] GenomicRanges_1.15.39 [11] GenomeInfoDb_0.99.19 [12] IRanges_1.21.34 [13] BiocGenerics_0.9.3 [14] Defaults_1.1-1 [15] devtools_1.4.1 [16] knitr_1.5 [17] BiocInstaller_1.13.3 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.5.17 [4] biomaRt_2.19.3 bitops_1.0-6 brew_1.0-6 [7] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 [10] evaluate_0.5.1 fail_1.2 foreach_1.4.1 [13] formatR_0.10 httr_0.2 iterators_1.0.6 [16] memoise_0.1 plyr_1.8.1 Rcpp_0.11.1 [19] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.23.18 [22] sendmailR_1.1-2 stats4_3.2.0 stringr_0.6.2 [25] tools_3.2.0 whisker_0.3-2 XML_3.98-1.1 [28] zlibbioc_1.9.0 > On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <voben...@fhcrc.org>wrote: > On 03/19/14 10:24, Michael Love wrote: > >> hi Valerie, >> >> If the Bam is not sorted by name, isn't it possible that readGAlignment* >> will load > yieldSize number of reads in order to find the mate? >> > > Sorry, our emails keep criss-crossing. > > Because the mate-pairing is now done in C yieldSize is no longer a > constraint. > > When yieldSize is given, say 100, then 100 mates are returned. The algo > moves through the file until 100 records are successfully paired. These 100 > are yielded to the user and the code picks up where it left off. A related > situation is the 'which' in a param. In this case you want the mates in a > particular range. The algo moves through the range and pairs what it can. > If it's left with unmated records it goes outside the range looking for > them. readGAlignmentsList() will return all records in this range, mated or > not. The metadata column of 'mate_status' indicates the different groupings. > > Valerie > > > >> Mike >> >> >> On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain <voben...@fhcrc.org >> <mailto:voben...@fhcrc.org>> wrote: >> >> Hi Mike, >> >> You no longer need to sort Bam files to use the pairing algo or >> yieldSize. The readGAlignment* functions now work with both >> constraints out of the box. >> >> Create a BamFile with yieldSize and indicate you want mates. >> bf <- BamFile(fl, yieldSize=10000, asMates=TRUE) >> >> Maybe set some specifications in a param: >> param <- ScanBamParam(what = c("qname", "flag")) >> >> Then call either readGAlignment* method that handles pairs: >> readGAlignmentsList(bf, param=param) >> readGAlignmentPairs(bf, param=param) >> >> For summarizeOverlaps(): >> summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE) >> >> We've considered removing the 'obeyQname' arg and documentation but >> thought the concept may be useful in another application. I'll >> revisit the summarizeOverlaps() documentation to make sure >> 'obeyQname' is downplayed and 'asMates' is encouraged. >> >> Valerie >> >> >> >> >> On 03/19/14 07:39, Michael Love wrote: >> >> hi, >> >> From last year, in order to use yieldSize with paired-end >> BAMs, I >> >> should sort the BAMs by qname and then use the following call to >> BamFile: >> >> library(pasillaBamSubset) >> fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE) >> bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE) >> >> https://stat.ethz.ch/__pipermail/bioconductor/2013-__ >> March/051490.html >> <https://stat.ethz.ch/pipermail/bioconductor/2013- >> March/051490.html> >> >> If I want to use GenomicAlignments::__readGAlignmentsList with >> >> asMates=TRUE and respecting the yieldSize, what is the proper >> construction? (in the end, I want to use summarizeOverlaps on >> paired-end BAMs while respecting the yieldSize) >> >> library(pasillaBamSubset) >> fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE) >> bf <- BamFile(fl, index=character(0), yieldSize=3, >> obeyQname=TRUE, asMates=TRUE) >> x <- readGAlignmentsList(bf) >> Warning message: >> In scanBam(bamfile, ..., param = param) : >> 'obeyQname=TRUE' ignored when 'asMates=TRUE' >> Calls: readGAlignmentsList ... .matesFromBam -> >> .load_bamcols_from_bamfile -> scanBam -> scanBam >> >> I see in the man pages for summarizeOverlaps it has: >> >> "In Bioconductor > 2.12 it is not >> necessary to sort paired-end BAM files by âqnameâ. When >> counting with âsummarizeOverlapsâ, setting âsingleEnd=FALSEâ >> will trigger paired-end reading and counting." >> >> but I don't see how this can respect the specified yieldSize, >> because >> readGAlignmentsList has to read in as many reads as necessary to >> find >> the mate. >> >> Sorry in advance if I am missing something in the documentation! >> >> Mike >> >> _________________________________________________ >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> mailing list >> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> >> >> >> > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel