I've been dealing with some small bam files with millions of reference sequences leading to monster headers. As one might guess, this leads to pretty bad performance when calling scanBam.
Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million reference sequences in the reference fasta file): > scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam") user system elapsed 186.264 0.528 186.934 I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic code is in scanBamHeader which processes the entire header when all seqinfo needs is the `targets` portion of the list. Additionally, the order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as well. I've changed the body of seqinfo-BamFile from: h <- scanBamHeader(x)[["targets"]] o <- order(rankSeqlevels(names(h))) Seqinfo(names(h)[o], unname(h)[o]) to this: if (!isOpen(x)) { open(x) on.exit(close(x)) } h <- .Call(.read_bamfile_header, .extptr(x))$targets Seqinfo(names(h), unname(h)) We then get: > scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam") user system elapsed 14.780 0.360 15.158 Which is still pretty slow for how small these files are, but I can probably live with that. Two questions: -- do we need the: order(rankSeqlevels(names(h))) bit? that does change the return value, but I can certainly live with the ordering in the file. -- what else am I missing? I can send a patch if need be, but would like to hear from the cognoscenti first if there is a "built-in" way to avoid this overhead. thanks, jim [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel