Herve, This is probably a naive question, but what usecases are there for creating an object with the wrong seqinfo for its genome?
~G On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <lawrence.mich...@gene.com > wrote: > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > > I also think that we're heading towards something like that. > > > > So genome(gr) <- "hg19" would: > > > > (a) Add any missing information to the seqinfo. > > (b) Sort the seqlevels in "canonical" order. > > (c) Change the seqlevels style to UCSC style if they are not. > > > > The 3 tasks are orthogonal. I guess most of the times people want > > an easy way to perform them all at once. > > > > We could easily support (a) and (b). This assumes that the current > > seqlevels are already valid hg19 seqlevels: > > > > si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2")) > > gr1 <- GRanges(seqinfo=si1) > > hg19_si <- Seqinfo(genome="hg19") > > > > ## (a): > > seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] > > seqinfo(gr1) > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: > > # seqnames seqlengths isCircular genome > > # chrX 155270560 FALSE hg19 > > # chrUn_gl000249 38502 FALSE hg19 > > # chr2 243199373 FALSE hg19 > > # chr6_cox_hap2 4795371 FALSE hg19 > > > > ## (b): > > seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1)) > > seqinfo(gr1) > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: > > # seqnames seqlengths isCircular genome > > # chr2 243199373 FALSE hg19 > > # chrX 155270560 FALSE hg19 > > # chr6_cox_hap2 4795371 FALSE hg19 > > # chrUn_gl000249 38502 FALSE hg19 > > > > (c) is harder because seqlevelsStyle() doesn't know how to rename > > scaffolds yet: > > > > si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", > "HSCHR6_MHC_COX_CTG1")) > > gr2 <- GRanges(seqinfo=si2) > > > > seqlevelsStyle(gr2) > > # [1] "NCBI" > > > > seqlevelsStyle(gr2) <- "UCSC" > > seqlevels(gr2) > > # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2" > > # [4] "HSCHR6_MHC_COX_CTG1" > > > > So we need to work on this. > > > > I'm not sure about using genome(gr) <- "hg19" for this. Right now > > it sets the genome column of the seqinfo with the supplied string > > and nothing else. Aren't there valid use cases for this? > > Not sure. People would almost always want the seqname style and order > to be consistent with the given genome. > > > What about > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the > > whole seqinfo component actually gets filled. > > > > Yea, but "genome" is so intuitive compared to "seqinfo". > > > > > H. > > > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: > >> > >> that's kind of always been my goal... > >> > >> > >> Statistics is the grammar of science. > >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> > >> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence > >> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote: > >> > >> Maybe this could eventually support setting the seqinfo with: > >> > >> genome(gr) <- "hg19" > >> > >> Or is that being too clever? > >> > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpa...@fredhutch.org > >> <mailto:hpa...@fredhutch.org>> wrote: > >> > Hi, > >> > > >> > FWIW I started to work on supporting quick generation of a > >> standalone > >> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb. > >> > > >> > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2, > >> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, > >> mm10, > >> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, > >> galGal3, > >> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, > sacCer3, > >> > and sacCer2. I'll add more. > >> > > >> > See ?Seqinfo for some examples. > >> > > >> > Right now it fetches the information from internet every time you > >> > call it but maybe we should just store that information in the > >> > GenomeInfoDb package as Tim suggested? > >> > > >> > H. > >> > > >> > > >> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote: > >> >> > >> >> That would be perfect actually. And it would radically reduce & > >> >> modularize maintenance. Maybe that's the best way to go after > >> all. Quite > >> >> sensible. > >> >> > >> >> --t > >> >> > >> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey > >> <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>> > >> >>> wrote: > >> >>> > >> >>> It really isn't hard to have multiple OrganismDb packages in > >> place -- the > >> >>> process of making new ones is documented and was given as an > >> exercise in > >> >>> the EdX course. I don't know if we want to institutionalize it > >> and > >> >>> distribute such -- I think we might, so that there would be > >> Hs19, Hs38, > >> >>> mm9, etc. packages. They have very little content, they just > >> coordinate > >> >>> interactions with packages that you'll already have. > >> >>> > >> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. > >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> > >> > >> >>> wrote: > >> >>> > >> >>>> Right, I typically do that too, and if you're working on human > >> data it > >> >>>> isn't a big deal. What makes things a lot more of a drag is > >> when you > >> >>>> work > >> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where > >> >>>> Mus.musculus > >> >>>> is essentially a "build ahead" of Homo.sapiens. > >> >>>> > >> >>>> R> seqinfo(Homo.sapiens) > >> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome > >> >>>> > >> >>>> R> seqinfo(Mus.musculus) > >> >>>> Seqinfo object with 66 sequences (1 circular) from mm10 > genome: > >> >>>> > >> >>>> It's not as explicit as directly assigning the seqinfo from a > >> genome > >> >>>> that > >> >>>> corresponds to that of your annotations/results/whatever. I > >> know we > >> >>>> could > >> >>>> all use crossmap or liftOver or whatever, but that's not > >> really the > >> >>>> same, > >> >>>> and it takes time, whereas assigning the proper seqinfo for > >> >>>> relationships > >> >>>> is very fast. > >> >>>> > >> >>>> That's all I was getting at... > >> >>>> > >> >>>> > >> >>>> Statistics is the grammar of science. > >> >>>> Karl Pearson > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> > >> >>>> > >> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey > >> >>>> <st...@channing.harvard.edu <mailto: > st...@channing.harvard.edu> > >> >>>>> > >> >>>>> wrote: > >> >>>> > >> >>>> > >> >>>>> I typically get this info from Homo.sapiens. The result is > >> parasitic > >> >>>>> on > >> >>>>> the TxDb that is in there. I don't know how easy it is to > swap > >> >>>>> alternate > >> >>>>> TxDb in to get a different build. I think it would make > sense > >> to > >> >>>>> regard > >> >>>>> the OrganismDb instances as foundational for this sort of > >> structural > >> >>>>> data. > >> >>>>> > >> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen < > >> >>>>> kasperdanielhan...@gmail.com > >> <mailto:kasperdanielhan...@gmail.com>> wrote: > >> >>>>> > >> >>>>>> Let me rephrase this slightly. From one POV the purpose of > >> >>>>>> GenomeInfoDb > >> >>>>>> is > >> >>>>>> clean up the seqinfo slot. Currently it does most of the > >> cleaning, > >> >>>>>> but > >> >>>>>> it > >> >>>>>> does not add seqlengths. > >> >>>>>> > >> >>>>>> It is clear that seqlengths depends on the version of the > >> genome, but > >> >>>>>> I > >> >>>>>> will argue so does the seqnames. Of course, for human, > >> chr22 will not > >> >>>>>> change. But what about the names of all the random > >> contigs? Or for > >> >>>>>> other > >> >>>>>> organisms, what about going from a draft genome with 10k > >> contigs to a > >> >>>>>> more > >> >>>>>> completely genome assembled into fewer, larger chromosomes. > >> >>>>>> > >> >>>>>> I acknowledge that this information is present in the > BSgenome > >> >>>>>> packages, > >> >>>>>> but it seems (to me) to be very appropriate to have them > >> around for > >> >>>>>> cleaning up the seqinfo slot. For some situations it is not > >> great to > >> >>>>>> depend on 1 GB> download for something that is a few bytes. > >> >>>>>> > >> >>>>>> Best, > >> >>>>>> Kasper > >> >>>>>> > >> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. > >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> > >> > >> >>>>>> wrote: > >> >>>>>> > >> >>>>>>> It would be nice (for a number of reasons) to have > >> chromosome lengths > >> >>>>>>> readily available in a foundational package like > >> GenomeInfoDb, so > >> >>>>>>> that, > >> >>>>>>> say, > >> >>>>>>> > >> >>>>>>> data(seqinfo.hg19) > >> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ] > >> >>>>>>> > >> >>>>>>> would work without issues. Is there any particular reason > >> this > >> >>>>>> > >> >>>>>> couldn't > >> >>>>>>> > >> >>>>>>> happen for the supported/available BSgenomes? It would > >> seem like a > >> >>>>>> > >> >>>>>> simple > >> >>>>>>> > >> >>>>>>> matter to do > >> >>>>>>> > >> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19) > >> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens) > >> >>>>>>> R> save(seqinfo.hg19, > >> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda") > >> >>>>>>> > >> >>>>>>> and be done with it until (say) the next release or next > >> released > >> >>>>>>> BSgenome. I considered looping through the following > >> BSgenomes > >> >>>>>> > >> >>>>>> myself... > >> >>>>>>> > >> >>>>>>> and if it isn't strongly opposed by (everyone) I may still > >> do exactly > >> >>>>>>> that. Seems useful, no? > >> >>>>>>> > >> >>>>>>> e.g. for the following 42 builds, > >> >>>>>>> > >> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", > >> available.genomes())), > >> >>>>>>> value=TRUE) > >> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2" > >> >>>>>> > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4" > >> >>>>>> > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8" > >> >>>>>>> "BSgenome.Celegans.UCSC.ce10" > >> >>>>>>> > >> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2" > >> "BSgenome.Celegans.UCSC.ce6" > >> >>>>>>> > >> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2" > >> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3" > >> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2" > >> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3" > >> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6" > >> >>>>>> > >> >>>>>> "BSgenome.Drerio.UCSC.danRer5" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6" > >> >>>>>> > >> >>>>>> "BSgenome.Drerio.UCSC.danRer7" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805" > >> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1" > >> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3" > >> >>>>>> > >> >>>>>> "BSgenome.Ggallus.UCSC.galGal4" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38" > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17" > >> >>>>>>> > >> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18" > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19" > >> >>>>>>> > >> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38" > >> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0" > >> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1" > >> >>>>>> > >> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3" > >> >>>>>> > >> >>>>>> "BSgenome.Mmusculus.UCSC.mm10" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8" > >> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9" > >> >>>>>>> > >> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2" > >> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3" > >> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4" > >> >>>>>> > >> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6" > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1" > >> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2" > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3" > >> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3" > >> >>>>>> > >> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1" > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Am I insane for suggesting this? It would make things a > >> little > >> >>>>>>> easier > >> >>>>>> > >> >>>>>> for > >> >>>>>>> > >> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived > >> objects, blah, > >> >>>>>> > >> >>>>>> blah, > >> >>>>>>> > >> >>>>>>> blah... > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Best, > >> >>>>>>> > >> >>>>>>> --t > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Statistics is the grammar of science. > >> >>>>>>> Karl Pearson > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> > >> >>>>>> > >> >>>>>> > >> >>>>>> [[alternative HTML version deleted]] > >> >>>>>> > >> >>>>>> _______________________________________________ > >> >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > >> mailing list > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> >>> > >> >>> > >> >>> [[alternative HTML version deleted]] > >> >>> > >> >>> _______________________________________________ > >> >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > >> mailing list > >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> >> > >> >> > >> >> _______________________________________________ > >> >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > >> mailing list > >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> >> > >> > > >> > -- > >> > Hervé Pagès > >> > > >> > Program in Computational Biology > >> > Division of Public Health Sciences > >> > Fred Hutchinson Cancer Research Center > >> > 1100 Fairview Ave. N, M1-B514 > >> > P.O. Box 19024 > >> > Seattle, WA 98109-1024 > >> > > >> > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org> > >> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > >> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > >> > > >> > > >> > _______________________________________________ > >> > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > >> mailing list > >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> > >> > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpa...@fredhutch.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > -- Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel