Hi Martin, thanks for the quick response. The data is certainly shareable. Here is a link to a bam + bai + sam file that I have been using for benchmarking: https://www.dropbox.com/s/eat31mnmmco1zoh/example-bam.tar.bz2
There is a method in SAM to elide the reference names from a header, but I think they are just shunted to another file so I gave up on that track. Since I only end up aligning to a small fraction of the transcripts, I might be able to post-process the file, but it would be best to process as-is. thanks again, jim On Tue, May 13, 2014 at 5:16 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: > Hi James -- I don't think there's anything in existence to make this > easier, but I'll expose something in the next 24 hours; is your data > shareable? There might be deeper things to be done for processing this > small-but-numerous style data. > > Martin > > > On 05/12/2014 05:32 PM, James Bullard wrote: > >> I've been dealing with some small bam files with millions of reference >> sequences leading to monster headers. As one might guess, this leads to >> pretty bad performance when calling scanBam. >> >> Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million >> reference sequences in the reference fasta file): >> >> scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam") >>> >> user system elapsed >> 186.264 0.528 186.934 >> >> I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic >> code is in scanBamHeader which processes the entire header when all >> seqinfo >> needs is the `targets` portion of the list. Additionally, the >> order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as >> well. I've changed the body of seqinfo-BamFile from: >> >> h <- scanBamHeader(x)[["targets"]] >> o <- order(rankSeqlevels(names(h))) >> Seqinfo(names(h)[o], unname(h)[o]) >> >> to this: >> >> if (!isOpen(x)) { >> open(x) >> on.exit(close(x)) >> } >> h <- .Call(.read_bamfile_header, .extptr(x))$targets >> Seqinfo(names(h), unname(h)) >> >> We then get: >> >> scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam") >>> >> user system elapsed >> 14.780 0.360 15.158 >> >> Which is still pretty slow for how small these files are, but I can >> probably live with that. >> >> Two questions: >> -- do we need the: order(rankSeqlevels(names(h))) bit? that does change >> the >> return value, but I can certainly live with the ordering in the file. >> -- what else am I missing? >> >> I can send a patch if need be, but would like to hear from the cognoscenti >> first if there is a "built-in" way to avoid this overhead. >> >> thanks, jim >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel