Hi Martin,

Thanks for the fix, I'll check it out.

jim


On Tue, May 20, 2014 at 12:35 PM, Martin Morgan <mtmor...@fhcrc.org> wrote:

> On 05/13/2014 08:17 AM, James Bullard wrote:
>
>> Hi Martin, thanks for the quick response. The data is certainly
>> shareable. Here
>> is a link to a bam + bai + sam file that I have been using for
>> benchmarking:
>> https://www.dropbox.com/s/eat31mnmmco1zoh/example-bam.tar.bz2
>>
>> There is a method in SAM to elide the reference names from a header, but
>> I think
>> they are just shunted to another file so I gave up on that track. Since I
>> only
>> end up aligning to a small fraction of the transcripts, I might be able to
>> post-process the file, but it would be best to process as-is.
>>
>
> Hi Jim -- I updated the seqinfo,BamFile-method to do more work in C, and
> for scanBamHeader to optionally parse only the targets|text part of the
> header. I also reverted a change to seqinfo,BamFile-method, introduced in
> Rsamtools version 1.15.28, to try to place seqlevels into 'natural' order;
> now they are returned in the order they appear in the file.
>
> Together these should make for much faster code, for your sim.bam about
> 3.5 (vs 185) seconds for seqinfo, and ~7s for scanBam.
>
> This is in Rsamtools version 1.17.16, which is in svn now but won't make
> it to biocLite until tomorrow, all being well...
>
> Martin
>
>
>> thanks again, jim
>>
>>
>>
>> On Tue, May 13, 2014 at 5:16 AM, Martin Morgan <mtmor...@fhcrc.org
>> <mailto:mtmor...@fhcrc.org>> wrote:
>>
>>     Hi James -- I don't think there's anything in existence to make this
>> easier,
>>     but I'll expose something in the next 24 hours; is your data
>> shareable?
>>     There might be deeper things to be done for processing this
>>     small-but-numerous style data.
>>
>>     Martin
>>
>>
>>     On 05/12/2014 05:32 PM, James Bullard wrote:
>>
>>         I've been dealing with some small bam files with millions of
>> reference
>>         sequences leading to monster headers. As one might guess, this
>> leads to
>>         pretty bad performance when calling scanBam.
>>
>>         Right now, it takes a bit (27MB bam file, 16k alignments, 2.5
>> million
>>         reference sequences in the reference fasta file):
>>
>>             scanBam("sim.fasta-L27-ma8-__mp6-rfg5_2-rdg3_1.bam")
>>
>>
>>              user  system elapsed
>>         186.264   0.528 186.934
>>
>>         I've traced it down to scanBamHeader and seqinfo-BamFile, the
>> problematic
>>         code is in scanBamHeader which processes the entire header when
>> all seqinfo
>>         needs is the `targets` portion of the list. Additionally, the
>>         order(rankSeqLevels(.)) doesn't scale either. So I've replaced
>> this as
>>         well. I've changed the body of seqinfo-BamFile from:
>>
>>               h <- scanBamHeader(x)[["targets"]]
>>               o <- order(rankSeqlevels(names(h)))
>>               Seqinfo(names(h)[o], unname(h)[o])
>>
>>         to this:
>>
>>               if (!isOpen(x)) {
>>                   open(x)
>>                   on.exit(close(x))
>>               }
>>               h <- .Call(.read_bamfile_header, .extptr(x))$targets
>>               Seqinfo(names(h), unname(h))
>>
>>         We then get:
>>
>>             scanBam("sim.fasta-L27-ma8-__mp6-rfg5_2-rdg3_1.bam")
>>
>>
>>              user  system elapsed
>>            14.780   0.360  15.158
>>
>>         Which is still pretty slow for how small these files are, but I
>> can
>>         probably live with that.
>>
>>         Two questions:
>>         -- do we need the: order(rankSeqlevels(names(h))) bit? that does
>> change the
>>         return value, but I can certainly live with the ordering in the
>> file.
>>         -- what else am I missing?
>>
>>         I can send a patch if need be, but would like to hear from the
>> cognoscenti
>>         first if there is a "built-in" way to avoid this overhead.
>>
>>         thanks, jim
>>
>>                  [[alternative HTML version deleted]]
>>
>>         _________________________________________________
>>         Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
>> mailing list
>>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>
>>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>     --
>>     Computational Biology / Fred Hutchinson Cancer Research Center
>>     1100 Fairview Ave. N.
>>     PO Box 19024 Seattle, WA 98109
>>
>>     Location: Arnold Building M1 B861
>>     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to