Hi Martin, thanks for the quick response. The data is certainly shareable.
Here is a link to a bam + bai + sam file that I have been using for
benchmarking: https://www.dropbox.com/s/eat31mnmmco1zoh/example-bam.tar.bz2

There is a method in SAM to elide the reference names from a header, but I
think they are just shunted to another file so I gave up on that track.
Since I only end up aligning to a small fraction of the transcripts, I
might be able to post-process the file, but it would be best to process
as-is.

thanks again, jim



On Tue, May 13, 2014 at 5:16 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:

> Hi James -- I don't think there's anything in existence to make this
> easier, but I'll expose something in the next 24 hours; is your data
> shareable? There might be deeper things to be done for processing this
> small-but-numerous style data.
>
> Martin
>
>
> On 05/12/2014 05:32 PM, James Bullard wrote:
>
>> I've been dealing with some small bam files with millions of reference
>> sequences leading to monster headers. As one might guess, this leads to
>> pretty bad performance when calling scanBam.
>>
>> Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million
>> reference sequences in the reference fasta file):
>>
>>  scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
>>>
>>     user  system elapsed
>> 186.264   0.528 186.934
>>
>> I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic
>> code is in scanBamHeader which processes the entire header when all
>> seqinfo
>> needs is the `targets` portion of the list. Additionally, the
>> order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as
>> well. I've changed the body of seqinfo-BamFile from:
>>
>>      h <- scanBamHeader(x)[["targets"]]
>>      o <- order(rankSeqlevels(names(h)))
>>      Seqinfo(names(h)[o], unname(h)[o])
>>
>> to this:
>>
>>      if (!isOpen(x)) {
>>          open(x)
>>          on.exit(close(x))
>>      }
>>      h <- .Call(.read_bamfile_header, .extptr(x))$targets
>>      Seqinfo(names(h), unname(h))
>>
>> We then get:
>>
>>  scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
>>>
>>     user  system elapsed
>>   14.780   0.360  15.158
>>
>> Which is still pretty slow for how small these files are, but I can
>> probably live with that.
>>
>> Two questions:
>> -- do we need the: order(rankSeqlevels(names(h))) bit? that does change
>> the
>> return value, but I can certainly live with the ordering in the file.
>> -- what else am I missing?
>>
>> I can send a patch if need be, but would like to hear from the cognoscenti
>> first if there is a "built-in" way to avoid this overhead.
>>
>> thanks, jim
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to