On Tue, Feb 11, 2014 at 6:58 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:
> Hi, > > Why not. But I don't expect a significant speed up. Here is why: > > There are actually 2 algos implemented by coverage(): one called "sort" > that computes the coverage directly into "Rle space", and one called > "hash" that computes the coverage into an ordinary integer vector and > turns this vector into an Rle at the end (this conversion is cheap). > > By default coverage() tries to automatically pick up the appropriate > algo: "hash" when the data are dense, "sort" otherwise. The criteria > used to decide whether the data are dense or not is a little bit > naive (and could maybe be improved?): it just compares the number > of ranges in the input with the length of the coverage vector to > return. If nb of ranges > 0.25 * length-of-coverage-vector, the data > is considered to be dense. Clearly this formula is kind of arbitrary > and I'm sure it could be tweaked a little bit to do a better job. > > Note that the user can choose the algo to use via the 'method' arg. > If you know your data are dense, use method="hash". It will be almost > as fast as if coverage() was returning an IntegerList, except that > the coverage is turned into an Rle (but only at the end). I would > expect this final coercion to be nothing compared to the computation > of the coverage itself. This would need to be confirmed by some > profiling though. > > Anyway maybe there are other benefits of returning an IntegerList: > smaller memory footprint when the data are dense, > more beginner-friendly container, maybe slightly faster > downstream computations (can this be a bottleneck?), others? > > Thanks for the explanation. The things you mention above are largely what motivated the suggestion. When dealing with e.g. WGS data, the Rles become cumbersome. > H. > > > > On 02/11/2014 05:06 PM, Michael Lawrence wrote: > >> Right, it would be a choice. The compression is not worth it when the data >> are dense. >> >> >> On Tue, Feb 11, 2014 at 4:18 PM, Kasper Daniel Hansen < >> kasperdanielhan...@gmail.com> wrote: >> >> Sounds reasonable, _especially_ if you think it is faster. You're the >>> expert. I assume you will allow the user to choose the return value? >>> Having the option of Rle's is still nice, for some use cases. >>> >>> >>> On Tue, Feb 11, 2014 at 7:12 PM, Michael Lawrence < >>> lawrence.mich...@gene.com> wrote: >>> >>> Just a thought: support coverage calculation directly to IntegerList. >>>> Will >>>> very often be faster than RleList, especially when limiting to regions >>>> without long runs of zeros, and with WGS data. >>>> >>>> Something to put on the TODO list? >>>> >>>> Michael >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>> >>>> >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel