nb. Somehow I got a typo in there: aGR <- df2GR(slow) mm9toHg19 <- import('mm9ToHg19.over.chain')
## was full <- liftOver(slow, mm9toHg19) full <- liftOver(aGR, mm9toHg19) length(unlist(full))/length(slow) ## [1] 0.549 That should remind me to re-run everything. In my defense... well, it's slow to re-run that :-/ apologies On Mon, Oct 7, 2013 at 10:58 AM, Tim Triche, Jr. <tim.tri...@gmail.com>wrote: > GRs are a great data structure. But, standard bioinformatic file formats > (BED, WIG, BAM) don't always fit 1:1 with the "organic" beginnings of some > projects. The GenomicRanges infrastructure isn't on the radar of every R > developer, and some useful data can be found in ugly formats. Wouldn't it > be handy if users could easily turn those blobs of data into something that > export() can handle? > > To better understand Michael's point of view, and as someone who has seen > firsthand the nontrivial amount of work required to maintain rtracklayer as > a high-performance import library, I wrote a few trivial Perl scripts to > convert some mouse data from assorted wacky tabular formats to standard > BED6 files. Besides, once I have BEDs, I can Tabix them, which speeds up > operations. > > I noticed that, when I imported the resulting BED files, where I had > cloned the base position into identical "chromStart" and "chromEnd" > coordinates, the import.bed() function assumed that I meant for them to be > UCSC-style, and therefore gave everything negative widths. (On the bright > side, this also explained why liftOver wasn't doing anything useful with > the results) > > packageVersion('rtracklayer') > ## [1] '1.21.12' > > wacky <- import('converted.theirSillyFormat.mm9.bed.gz', genome='mm9') > mm9toHg19 <- import('mm9ToHg19.over.chain') > empty <- liftOver(wacky, mm9toHg19) > length(unlist(empty))/length(wacky) > ## [1] 0 > > ## cursing ensues > > That's not so good. If I wasn't already aware of the insanity of > UCSC-style indexing, this could have been a problem in and of itself. (As > it was, I fixed it) > > slow <- read.table('theirSillyFormat.mm9.txt.gz') > ## time passes... > > aGR <- df2GR(slow) > mm9toHg19 <- import('mm9ToHg19.over.chain') > full <- liftOver(slow, mm9toHg19) > length(unlist(full))/length(slow) > ## [1] 0.549 > > ## better late than never > > So: turning nonstandard data into standard data with "standard" (grr) UCSC > assumptions took longer than simply brute-forcing the issue with > read.table(). > > After importing the files with the "appropriate" (chrom, base-1, base) > indexing, I then went to liftOver() a related GRanges. (Note that the > related GRanges was from a BED file submitted by guys at the Broad, so any > wacky formatting wasn't an issue in this case; here I wanted to control for > any other possible fubars). > > foo <- import('an.RRBS.file.mm9.bed.gz', genome='mm9') > mm9toHg19 <- import('mm9ToHg19.over.chain') > bar <- liftOver(foo, mm9toHg19) > > length(unlist(bar)) / length(foo) > ## [1] 0.622 > > Needless to say, this was a hell of a lot faster than importing the > corresponding file as a table. However, for the wacky file formats, it was > more time & trouble to decode all the assumptions prior to liftOver() than > it would have been to use granges(read.table('wacky.file.format.csv.gz')). > Ugly and sad, but still true! > > So, in conclusion, sometimes it might just be better to import one of > those wacky file formats to a data.frame D or what have you, and use > granges(D). > > Or, since I'm one of the people that wrote a df2GR() function, I just use > that. > > --t > > > > > On Mon, Oct 7, 2013 at 9:23 AM, Steve Lianoglou > <lianoglou.st...@gene.com>wrote: > >> Hi, >> >> +1 from me, too ... I've also had a similar conversion function >> (data.[frame|table] <--> GRanges) in my toolbelt which I found quite >> useful over the years. >> >> -steve >> >> On Sun, Oct 6, 2013 at 5:00 PM, Kasper Daniel Hansen >> <kasperdanielhan...@gmail.com> wrote: >> > This is a convenience function, which provably has saved tons of time >> for >> > me and others. I get lots of data from various excel/cvs files lying >> > around various places, and these files _always_ have a clear path to a >> > GRanges. Perhaps you never have to deal with this kind of data, but we >> are >> > a few experienced users who find it extremely handy and would like it >> to be >> > in a more centralized place. >> > >> > >> > >> > >> > >> > On Sun, Oct 6, 2013 at 4:26 PM, Michael Lawrence >> > <lawrence.mich...@gene.com>wrote: >> > >> >> I'm still unconvinced that there is an obvious, general path from >> >> data.frame -> GRanges. It's usually easy enough to just call GRanges(), >> >> often of the pattern with(df, GRanges(...)). Moreover, it's unusual >> for me >> >> to encounter genomic data in data.frames. >> >> >> >> >> >> >> >> >> >> On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen < >> >> kasperdanielhan...@gmail.com> wrote: >> >> >> >>> Also, it goes without saying that I am happy to provide a patch for >> >>> GenomicRanges, and check that it passes R CMD check, to minimize the >> work >> >>> of the maintainer. >> >>> >> >>> Kasper >> >>> >> >>> >> >>> On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen < >> >>> kasperdanielhan...@gmail.com> wrote: >> >>> >> >>> > bsseq::data.frame2GRanges does the obvious step of converting a >> >>> data.frame >> >>> > to GRanges. It has a couple of bells and whistles where strand can >> be >> >>> > ignored and additional columns (apart from genomic location) may be >> >>> ignore >> >>> > in the output object. >> >>> > >> >>> > I (and now quite a few other people) use this function almost every >> day. >> >>> > I have seen other implementations in other packages, suggesting >> this is >> >>> > not just something I (we) use. >> >>> > >> >>> > I suggests adding this function to GenomicRanges. I am happy to >> support >> >>> > it going forward. >> >>> > >> >>> > Using this function we could also add an as(x, "GRanges") method for >> >>> > x=data.frame, but I still suggest keeping the basic function for the >> >>> > extended functionality it provides. >> >>> > >> >>> > Best, >> >>> > Kasper >> >>> > >> >>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> _______________________________________________ >> >>> Bioc-devel@r-project.org mailing list >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >>> >> >> >> >> >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioc-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > > > -- > *He that would live in peace and at ease, * > *Must not speak all he knows, nor judge all he sees.* > * > * > Benjamin Franklin, Poor Richard's > Almanack<http://archive.org/details/poorrichardsalma00franrich> > -- *He that would live in peace and at ease, * *Must not speak all he knows, nor judge all he sees.* * * Benjamin Franklin, Poor Richard's Almanack<http://archive.org/details/poorrichardsalma00franrich> [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel