Re: [Bioc-devel] coverage as IntegerList

Michael Lawrence Wed, 12 Feb 2014 04:59:03 -0800

On Tue, Feb 11, 2014 at 6:58 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:


> Hi,
>
> Why not. But I don't expect a significant speed up. Here is why:
>
> There are actually 2 algos implemented by coverage(): one called "sort"
> that computes the coverage directly into "Rle space", and one called
> "hash" that computes the coverage into an ordinary integer vector and
> turns this vector into an Rle at the end (this conversion is cheap).
>
> By default coverage() tries to automatically pick up the appropriate
> algo: "hash" when the data are dense, "sort" otherwise. The criteria
> used to decide whether the data are dense or not is a little bit
> naive (and could maybe be improved?): it just compares the number
> of ranges in the input with the length of the coverage vector to
> return. If nb of ranges > 0.25 * length-of-coverage-vector, the data
> is considered to be dense. Clearly this formula is kind of arbitrary
> and I'm sure it could be tweaked a little bit to do a better job.
>
> Note that the user can choose the algo to use via the 'method' arg.
> If you know your data are dense, use method="hash". It will be almost
> as fast as if coverage() was returning an IntegerList, except that
> the coverage is turned into an Rle (but only at the end). I would
> expect this final coercion to be nothing compared to the computation
> of the coverage itself. This would need to be confirmed by some
> profiling though.
>
> Anyway maybe there are other benefits of returning an IntegerList:
> smaller memory footprint when the data are dense,
> more beginner-friendly container, maybe slightly faster
> downstream computations (can this be a bottleneck?), others?
>
>
Thanks for the explanation. The things you mention above are largely what
motivated the suggestion. When dealing with e.g. WGS data, the Rles become
cumbersome.


> H.
>
>
>
> On 02/11/2014 05:06 PM, Michael Lawrence wrote:
>
>> Right, it would be a choice. The compression is not worth it when the data
>> are dense.
>>
>>
>> On Tue, Feb 11, 2014 at 4:18 PM, Kasper Daniel Hansen <
>> kasperdanielhan...@gmail.com> wrote:
>>
>>  Sounds reasonable, _especially_ if you think it is faster.  You're the
>>> expert.  I assume you will allow the user to choose the return value?
>>>   Having the option of Rle's is still nice, for some use cases.
>>>
>>>
>>> On Tue, Feb 11, 2014 at 7:12 PM, Michael Lawrence <
>>> lawrence.mich...@gene.com> wrote:
>>>
>>>  Just a thought: support coverage calculation directly to IntegerList.
>>>> Will
>>>> very often be faster than RleList, especially when limiting to regions
>>>> without long runs of zeros, and with WGS data.
>>>>
>>>> Something to put on the TODO list?
>>>>
>>>> Michael
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] coverage as IntegerList

Reply via email to