Hi Martin, Thanks for the fix, I'll check it out.
jim On Tue, May 20, 2014 at 12:35 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 05/13/2014 08:17 AM, James Bullard wrote: > >> Hi Martin, thanks for the quick response. The data is certainly >> shareable. Here >> is a link to a bam + bai + sam file that I have been using for >> benchmarking: >> https://www.dropbox.com/s/eat31mnmmco1zoh/example-bam.tar.bz2 >> >> There is a method in SAM to elide the reference names from a header, but >> I think >> they are just shunted to another file so I gave up on that track. Since I >> only >> end up aligning to a small fraction of the transcripts, I might be able to >> post-process the file, but it would be best to process as-is. >> > > Hi Jim -- I updated the seqinfo,BamFile-method to do more work in C, and > for scanBamHeader to optionally parse only the targets|text part of the > header. I also reverted a change to seqinfo,BamFile-method, introduced in > Rsamtools version 1.15.28, to try to place seqlevels into 'natural' order; > now they are returned in the order they appear in the file. > > Together these should make for much faster code, for your sim.bam about > 3.5 (vs 185) seconds for seqinfo, and ~7s for scanBam. > > This is in Rsamtools version 1.17.16, which is in svn now but won't make > it to biocLite until tomorrow, all being well... > > Martin > > >> thanks again, jim >> >> >> >> On Tue, May 13, 2014 at 5:16 AM, Martin Morgan <mtmor...@fhcrc.org >> <mailto:mtmor...@fhcrc.org>> wrote: >> >> Hi James -- I don't think there's anything in existence to make this >> easier, >> but I'll expose something in the next 24 hours; is your data >> shareable? >> There might be deeper things to be done for processing this >> small-but-numerous style data. >> >> Martin >> >> >> On 05/12/2014 05:32 PM, James Bullard wrote: >> >> I've been dealing with some small bam files with millions of >> reference >> sequences leading to monster headers. As one might guess, this >> leads to >> pretty bad performance when calling scanBam. >> >> Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 >> million >> reference sequences in the reference fasta file): >> >> scanBam("sim.fasta-L27-ma8-__mp6-rfg5_2-rdg3_1.bam") >> >> >> user system elapsed >> 186.264 0.528 186.934 >> >> I've traced it down to scanBamHeader and seqinfo-BamFile, the >> problematic >> code is in scanBamHeader which processes the entire header when >> all seqinfo >> needs is the `targets` portion of the list. Additionally, the >> order(rankSeqLevels(.)) doesn't scale either. So I've replaced >> this as >> well. I've changed the body of seqinfo-BamFile from: >> >> h <- scanBamHeader(x)[["targets"]] >> o <- order(rankSeqlevels(names(h))) >> Seqinfo(names(h)[o], unname(h)[o]) >> >> to this: >> >> if (!isOpen(x)) { >> open(x) >> on.exit(close(x)) >> } >> h <- .Call(.read_bamfile_header, .extptr(x))$targets >> Seqinfo(names(h), unname(h)) >> >> We then get: >> >> scanBam("sim.fasta-L27-ma8-__mp6-rfg5_2-rdg3_1.bam") >> >> >> user system elapsed >> 14.780 0.360 15.158 >> >> Which is still pretty slow for how small these files are, but I >> can >> probably live with that. >> >> Two questions: >> -- do we need the: order(rankSeqlevels(names(h))) bit? that does >> change the >> return value, but I can certainly live with the ordering in the >> file. >> -- what else am I missing? >> >> I can send a patch if need be, but would like to hear from the >> cognoscenti >> first if there is a "built-in" way to avoid this overhead. >> >> thanks, jim >> >> [[alternative HTML version deleted]] >> >> _________________________________________________ >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> mailing list >> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >> >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 <tel:%28206%29%20667-2793> >> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel