Hi James -- I don't think there's anything in existence to make this easier, but
I'll expose something in the next 24 hours; is your data shareable? There might
be deeper things to be done for processing this small-but-numerous style data.
Martin
On 05/12/2014 05:32 PM, James Bullard wrote:
I've been dealing with some small bam files with millions of reference
sequences leading to monster headers. As one might guess, this leads to
pretty bad performance when calling scanBam.
Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million
reference sequences in the reference fasta file):
scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
user system elapsed
186.264 0.528 186.934
I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic
code is in scanBamHeader which processes the entire header when all seqinfo
needs is the `targets` portion of the list. Additionally, the
order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as
well. I've changed the body of seqinfo-BamFile from:
h <- scanBamHeader(x)[["targets"]]
o <- order(rankSeqlevels(names(h)))
Seqinfo(names(h)[o], unname(h)[o])
to this:
if (!isOpen(x)) {
open(x)
on.exit(close(x))
}
h <- .Call(.read_bamfile_header, .extptr(x))$targets
Seqinfo(names(h), unname(h))
We then get:
scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
user system elapsed
14.780 0.360 15.158
Which is still pretty slow for how small these files are, but I can
probably live with that.
Two questions:
-- do we need the: order(rankSeqlevels(names(h))) bit? that does change the
return value, but I can certainly live with the ordering in the file.
-- what else am I missing?
I can send a patch if need be, but would like to hear from the cognoscenti
first if there is a "built-in" way to avoid this overhead.
thanks, jim
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel