Re: [Bioc-devel] could bsseq::data.frame2GRanges be added to GenomicRanges

Tim Triche, Jr. Mon, 07 Oct 2013 11:02:50 -0700

nb.  Somehow I got a typo in there:

aGR <- df2GR(slow)
mm9toHg19 <- import('mm9ToHg19.over.chain')


## was full <- liftOver(slow, mm9toHg19)
full <- liftOver(aGR, mm9toHg19)
length(unlist(full))/length(slow)
## [1] 0.549

That should remind me to re-run everything.
In my defense... well, it's slow to re-run that :-/

apologies


On Mon, Oct 7, 2013 at 10:58 AM, Tim Triche, Jr. <tim.tri...@gmail.com>wrote:

> GRs are a great data structure.  But, standard bioinformatic file formats
> (BED, WIG, BAM) don't always fit 1:1 with the "organic" beginnings of some
> projects. The GenomicRanges infrastructure isn't on the radar of every R
> developer, and some useful data can be found in ugly formats.  Wouldn't it
> be handy if users could easily turn those blobs of data into something that
> export() can handle?
>
> To better understand Michael's point of view, and as someone who has seen
> firsthand the nontrivial amount of work required to maintain rtracklayer as
> a high-performance import library, I wrote a few trivial Perl scripts to
> convert some mouse data from assorted wacky tabular formats to standard
> BED6 files. Besides, once I have BEDs, I can Tabix them, which speeds up
> operations.
>
> I noticed that, when I imported the resulting BED files, where I had
> cloned the base position into identical "chromStart" and "chromEnd"
> coordinates, the import.bed() function assumed that I meant for them to be
> UCSC-style, and therefore gave everything negative widths.  (On the bright
> side, this also explained why liftOver wasn't doing anything useful with
> the results)
>
> packageVersion('rtracklayer')
> ## [1] '1.21.12'
>
> wacky <- import('converted.theirSillyFormat.mm9.bed.gz', genome='mm9')
> mm9toHg19 <- import('mm9ToHg19.over.chain')
> empty <- liftOver(wacky, mm9toHg19)
> length(unlist(empty))/length(wacky)
> ## [1] 0
>
> ## cursing ensues
>
> That's not so good.  If I wasn't already aware of the insanity of
> UCSC-style indexing, this could have been a problem in and of itself.  (As
> it was, I fixed it)
>
> slow <- read.table('theirSillyFormat.mm9.txt.gz')
> ## time passes...
>
> aGR <- df2GR(slow)
> mm9toHg19 <- import('mm9ToHg19.over.chain')
> full <- liftOver(slow, mm9toHg19)
> length(unlist(full))/length(slow)
> ## [1] 0.549
>
> ## better late than never
>
> So: turning nonstandard data into standard data with "standard" (grr) UCSC
> assumptions took longer than simply brute-forcing the issue with
> read.table().
>
> After importing the files with the "appropriate" (chrom, base-1, base)
> indexing, I then went to liftOver() a related GRanges.  (Note that the
> related GRanges was from a BED file submitted by guys at the Broad, so any
> wacky formatting wasn't an issue in this case; here I wanted to control for
> any other possible fubars).
>
> foo <- import('an.RRBS.file.mm9.bed.gz', genome='mm9')
> mm9toHg19 <- import('mm9ToHg19.over.chain')
> bar <- liftOver(foo, mm9toHg19)
>
> length(unlist(bar)) / length(foo)
> ## [1] 0.622
>
> Needless to say, this was a hell of a lot faster than importing the
> corresponding file as a table.  However, for the wacky file formats, it was
> more time & trouble to decode all the assumptions prior to liftOver() than
> it would have been to use granges(read.table('wacky.file.format.csv.gz')).
>  Ugly and sad, but still true!
>
> So, in conclusion, sometimes it might just be better to import one of
> those wacky file formats to a data.frame D or what have you, and use
> granges(D).
>
> Or, since I'm one of the people that wrote a df2GR() function, I just use
> that.
>
> --t
>
>
>
>
> On Mon, Oct 7, 2013 at 9:23 AM, Steve Lianoglou 
> <lianoglou.st...@gene.com>wrote:
>
>> Hi,
>>
>> +1 from me, too ... I've also had a similar conversion function
>> (data.[frame|table] <--> GRanges) in my toolbelt which I found quite
>> useful over the years.
>>
>> -steve
>>
>> On Sun, Oct 6, 2013 at 5:00 PM, Kasper Daniel Hansen
>> <kasperdanielhan...@gmail.com> wrote:
>> > This is a convenience function, which provably has saved tons of time
>> for
>> > me and others.  I get lots of data from various excel/cvs files lying
>> > around various places, and these files _always_ have a clear path to a
>> > GRanges.  Perhaps you never have to deal with this kind of data, but we
>> are
>> > a few experienced users who find it extremely handy and would like it
>> to be
>> > in a more centralized place.
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Oct 6, 2013 at 4:26 PM, Michael Lawrence
>> > <lawrence.mich...@gene.com>wrote:
>> >
>> >> I'm still unconvinced that there is an obvious, general path from
>> >> data.frame -> GRanges. It's usually easy enough to just call GRanges(),
>> >> often of the pattern with(df, GRanges(...)). Moreover, it's unusual
>> for me
>> >> to encounter genomic data in data.frames.
>> >>
>> >>
>> >>
>> >>
>> >> On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen <
>> >> kasperdanielhan...@gmail.com> wrote:
>> >>
>> >>> Also, it goes without saying that I am happy to provide a patch for
>> >>> GenomicRanges, and check that it passes R CMD check, to minimize the
>> work
>> >>> of the maintainer.
>> >>>
>> >>> Kasper
>> >>>
>> >>>
>> >>> On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen <
>> >>> kasperdanielhan...@gmail.com> wrote:
>> >>>
>> >>> > bsseq::data.frame2GRanges does the obvious step of converting a
>> >>> data.frame
>> >>> > to GRanges.  It has a couple of bells and whistles where strand can
>> be
>> >>> > ignored and additional columns (apart from genomic location) may be
>> >>> ignore
>> >>> > in the output object.
>> >>> >
>> >>> > I (and now quite a few other people) use this function almost every
>> day.
>> >>> >  I have seen other implementations in other packages, suggesting
>> this is
>> >>> > not just something I (we) use.
>> >>> >
>> >>> > I suggests adding this function to GenomicRanges.  I am happy to
>> support
>> >>> > it going forward.
>> >>> >
>> >>> > Using this function we could also add an as(x, "GRanges") method for
>> >>> > x=data.frame, but I still suggest keeping the basic function for the
>> >>> > extended functionality it provides.
>> >>> >
>> >>> > Best,
>> >>> > Kasper
>> >>> >
>> >>>
>> >>>         [[alternative HTML version deleted]]
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel@r-project.org mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>>
>> >>
>> >>
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
>
> --
> *He that would live in peace and at ease, *
> *Must not speak all he knows, nor judge all he sees.*
>  *
> *
> Benjamin Franklin, Poor Richard's 
> Almanack<http://archive.org/details/poorrichardsalma00franrich>
>



-- 
*He that would live in peace and at ease, *
*Must not speak all he knows, nor judge all he sees.*
*
*
Benjamin Franklin, Poor Richard's
Almanack<http://archive.org/details/poorrichardsalma00franrich>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] could bsseq::data.frame2GRanges be added to GenomicRanges

Reply via email to