GRs are a great data structure. But, standard bioinformatic file formats (BED, WIG, BAM) don't always fit 1:1 with the "organic" beginnings of some projects. The GenomicRanges infrastructure isn't on the radar of every R developer, and some useful data can be found in ugly formats. Wouldn't it be handy if users could easily turn those blobs of data into something that export() can handle?
To better understand Michael's point of view, and as someone who has seen firsthand the nontrivial amount of work required to maintain rtracklayer as a high-performance import library, I wrote a few trivial Perl scripts to convert some mouse data from assorted wacky tabular formats to standard BED6 files. Besides, once I have BEDs, I can Tabix them, which speeds up operations. I noticed that, when I imported the resulting BED files, where I had cloned the base position into identical "chromStart" and "chromEnd" coordinates, the import.bed() function assumed that I meant for them to be UCSC-style, and therefore gave everything negative widths. (On the bright side, this also explained why liftOver wasn't doing anything useful with the results) packageVersion('rtracklayer') ## [1] '1.21.12' wacky <- import('converted.theirSillyFormat.mm9.bed.gz', genome='mm9') mm9toHg19 <- import('mm9ToHg19.over.chain') empty <- liftOver(wacky, mm9toHg19) length(unlist(empty))/length(wacky) ## [1] 0 ## cursing ensues That's not so good. If I wasn't already aware of the insanity of UCSC-style indexing, this could have been a problem in and of itself. (As it was, I fixed it) slow <- read.table('theirSillyFormat.mm9.txt.gz') ## time passes... aGR <- df2GR(slow) mm9toHg19 <- import('mm9ToHg19.over.chain') full <- liftOver(slow, mm9toHg19) length(unlist(full))/length(slow) ## [1] 0.549 ## better late than never So: turning nonstandard data into standard data with "standard" (grr) UCSC assumptions took longer than simply brute-forcing the issue with read.table(). After importing the files with the "appropriate" (chrom, base-1, base) indexing, I then went to liftOver() a related GRanges. (Note that the related GRanges was from a BED file submitted by guys at the Broad, so any wacky formatting wasn't an issue in this case; here I wanted to control for any other possible fubars). foo <- import('an.RRBS.file.mm9.bed.gz', genome='mm9') mm9toHg19 <- import('mm9ToHg19.over.chain') bar <- liftOver(foo, mm9toHg19) length(unlist(bar)) / length(foo) ## [1] 0.622 Needless to say, this was a hell of a lot faster than importing the corresponding file as a table. However, for the wacky file formats, it was more time & trouble to decode all the assumptions prior to liftOver() than it would have been to use granges(read.table('wacky.file.format.csv.gz')). Ugly and sad, but still true! So, in conclusion, sometimes it might just be better to import one of those wacky file formats to a data.frame D or what have you, and use granges(D). Or, since I'm one of the people that wrote a df2GR() function, I just use that. --t On Mon, Oct 7, 2013 at 9:23 AM, Steve Lianoglou <lianoglou.st...@gene.com>wrote: > Hi, > > +1 from me, too ... I've also had a similar conversion function > (data.[frame|table] <--> GRanges) in my toolbelt which I found quite > useful over the years. > > -steve > > On Sun, Oct 6, 2013 at 5:00 PM, Kasper Daniel Hansen > <kasperdanielhan...@gmail.com> wrote: > > This is a convenience function, which provably has saved tons of time for > > me and others. I get lots of data from various excel/cvs files lying > > around various places, and these files _always_ have a clear path to a > > GRanges. Perhaps you never have to deal with this kind of data, but we > are > > a few experienced users who find it extremely handy and would like it to > be > > in a more centralized place. > > > > > > > > > > > > On Sun, Oct 6, 2013 at 4:26 PM, Michael Lawrence > > <lawrence.mich...@gene.com>wrote: > > > >> I'm still unconvinced that there is an obvious, general path from > >> data.frame -> GRanges. It's usually easy enough to just call GRanges(), > >> often of the pattern with(df, GRanges(...)). Moreover, it's unusual for > me > >> to encounter genomic data in data.frames. > >> > >> > >> > >> > >> On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen < > >> kasperdanielhan...@gmail.com> wrote: > >> > >>> Also, it goes without saying that I am happy to provide a patch for > >>> GenomicRanges, and check that it passes R CMD check, to minimize the > work > >>> of the maintainer. > >>> > >>> Kasper > >>> > >>> > >>> On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen < > >>> kasperdanielhan...@gmail.com> wrote: > >>> > >>> > bsseq::data.frame2GRanges does the obvious step of converting a > >>> data.frame > >>> > to GRanges. It has a couple of bells and whistles where strand can > be > >>> > ignored and additional columns (apart from genomic location) may be > >>> ignore > >>> > in the output object. > >>> > > >>> > I (and now quite a few other people) use this function almost every > day. > >>> > I have seen other implementations in other packages, suggesting > this is > >>> > not just something I (we) use. > >>> > > >>> > I suggests adding this function to GenomicRanges. I am happy to > support > >>> > it going forward. > >>> > > >>> > Using this function we could also add an as(x, "GRanges") method for > >>> > x=data.frame, but I still suggest keeping the basic function for the > >>> > extended functionality it provides. > >>> > > >>> > Best, > >>> > Kasper > >>> > > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioc-devel@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >>> > >> > >> > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > -- *He that would live in peace and at ease, * *Must not speak all he knows, nor judge all he sees.* * * Benjamin Franklin, Poor Richard's Almanack<http://archive.org/details/poorrichardsalma00franrich> [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel