That sounds like it calls for an (class-style) inheritence/genome-union model to me. I should probably stop talking now before the people who would have to implement that start throwing things at me, though.
~G On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen < kasperdanielhan...@gmail.com> wrote: > In WGBS we frequently sequence a human with spikein from the lambda > genome. In this case, most of the chromosomes of the Granges are from > human, except one. This is a usecase where genome(GR) is not constant. I > suggest, partly for compatibility, to keep genome, but perhaps do something > like > standardizeGenome() > or something like this. > > I would indeed love, love, love a function which just cleans it up. > > Kasper > > On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker <becker.g...@gene.com> wrote: > >> Herve, >> >> This is probably a naive question, but what usecases are there for >> creating >> an object with the wrong seqinfo for its genome? >> >> ~G >> >> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence < >> lawrence.mich...@gene.com >> > wrote: >> >> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpa...@fredhutch.org> >> wrote: >> > > I also think that we're heading towards something like that. >> > > >> > > So genome(gr) <- "hg19" would: >> > > >> > > (a) Add any missing information to the seqinfo. >> > > (b) Sort the seqlevels in "canonical" order. >> > > (c) Change the seqlevels style to UCSC style if they are not. >> > > >> > > The 3 tasks are orthogonal. I guess most of the times people want >> > > an easy way to perform them all at once. >> > > >> > > We could easily support (a) and (b). This assumes that the current >> > > seqlevels are already valid hg19 seqlevels: >> > > >> > > si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2")) >> > > gr1 <- GRanges(seqinfo=si1) >> > > hg19_si <- Seqinfo(genome="hg19") >> > > >> > > ## (a): >> > > seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] >> > > seqinfo(gr1) >> > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: >> > > # seqnames seqlengths isCircular genome >> > > # chrX 155270560 FALSE hg19 >> > > # chrUn_gl000249 38502 FALSE hg19 >> > > # chr2 243199373 FALSE hg19 >> > > # chr6_cox_hap2 4795371 FALSE hg19 >> > > >> > > ## (b): >> > > seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1)) >> > > seqinfo(gr1) >> > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: >> > > # seqnames seqlengths isCircular genome >> > > # chr2 243199373 FALSE hg19 >> > > # chrX 155270560 FALSE hg19 >> > > # chr6_cox_hap2 4795371 FALSE hg19 >> > > # chrUn_gl000249 38502 FALSE hg19 >> > > >> > > (c) is harder because seqlevelsStyle() doesn't know how to rename >> > > scaffolds yet: >> > > >> > > si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", >> > "HSCHR6_MHC_COX_CTG1")) >> > > gr2 <- GRanges(seqinfo=si2) >> > > >> > > seqlevelsStyle(gr2) >> > > # [1] "NCBI" >> > > >> > > seqlevelsStyle(gr2) <- "UCSC" >> > > seqlevels(gr2) >> > > # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2" >> > > # [4] "HSCHR6_MHC_COX_CTG1" >> > > >> > > So we need to work on this. >> > > >> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now >> > > it sets the genome column of the seqinfo with the supplied string >> > > and nothing else. Aren't there valid use cases for this? >> > >> > Not sure. People would almost always want the seqname style and order >> > to be consistent with the given genome. >> > >> > > What about >> > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the >> > > whole seqinfo component actually gets filled. >> > > >> > >> > Yea, but "genome" is so intuitive compared to "seqinfo". >> > >> > >> > >> > > H. >> > > >> > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: >> > >> >> > >> that's kind of always been my goal... >> > >> >> > >> >> > >> Statistics is the grammar of science. >> > >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> > >> >> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence >> > >> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> >> wrote: >> > >> >> > >> Maybe this could eventually support setting the seqinfo with: >> > >> >> > >> genome(gr) <- "hg19" >> > >> >> > >> Or is that being too clever? >> > >> >> > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès < >> hpa...@fredhutch.org >> > >> <mailto:hpa...@fredhutch.org>> wrote: >> > >> > Hi, >> > >> > >> > >> > FWIW I started to work on supporting quick generation of a >> > >> standalone >> > >> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb. >> > >> > >> > >> > It already supports hg38, hg19, hg18, panTro4, panTro3, >> panTro2, >> > >> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, >> > >> mm10, >> > >> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, >> > >> galGal3, >> > >> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, >> > sacCer3, >> > >> > and sacCer2. I'll add more. >> > >> > >> > >> > See ?Seqinfo for some examples. >> > >> > >> > >> > Right now it fetches the information from internet every time >> you >> > >> > call it but maybe we should just store that information in the >> > >> > GenomeInfoDb package as Tim suggested? >> > >> > >> > >> > H. >> > >> > >> > >> > >> > >> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote: >> > >> >> >> > >> >> That would be perfect actually. And it would radically >> reduce & >> > >> >> modularize maintenance. Maybe that's the best way to go >> after >> > >> all. Quite >> > >> >> sensible. >> > >> >> >> > >> >> --t >> > >> >> >> > >> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey >> > >> <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>> >> > >> >>> wrote: >> > >> >>> >> > >> >>> It really isn't hard to have multiple OrganismDb packages in >> > >> place -- the >> > >> >>> process of making new ones is documented and was given as an >> > >> exercise in >> > >> >>> the EdX course. I don't know if we want to >> institutionalize it >> > >> and >> > >> >>> distribute such -- I think we might, so that there would be >> > >> Hs19, Hs38, >> > >> >>> mm9, etc. packages. They have very little content, they >> just >> > >> coordinate >> > >> >>> interactions with packages that you'll already have. >> > >> >>> >> > >> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. >> > >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> >> > >> >> > >> >>> wrote: >> > >> >>> >> > >> >>>> Right, I typically do that too, and if you're working on >> human >> > >> data it >> > >> >>>> isn't a big deal. What makes things a lot more of a drag >> is >> > >> when you >> > >> >>>> work >> > >> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) >> where >> > >> >>>> Mus.musculus >> > >> >>>> is essentially a "build ahead" of Homo.sapiens. >> > >> >>>> >> > >> >>>> R> seqinfo(Homo.sapiens) >> > >> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 >> genome >> > >> >>>> >> > >> >>>> R> seqinfo(Mus.musculus) >> > >> >>>> Seqinfo object with 66 sequences (1 circular) from mm10 >> > genome: >> > >> >>>> >> > >> >>>> It's not as explicit as directly assigning the seqinfo >> from a >> > >> genome >> > >> >>>> that >> > >> >>>> corresponds to that of your annotations/results/whatever. >> I >> > >> know we >> > >> >>>> could >> > >> >>>> all use crossmap or liftOver or whatever, but that's not >> > >> really the >> > >> >>>> same, >> > >> >>>> and it takes time, whereas assigning the proper seqinfo for >> > >> >>>> relationships >> > >> >>>> is very fast. >> > >> >>>> >> > >> >>>> That's all I was getting at... >> > >> >>>> >> > >> >>>> >> > >> >>>> Statistics is the grammar of science. >> > >> >>>> Karl Pearson >> > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> > >> >>>> >> > >> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey >> > >> >>>> <st...@channing.harvard.edu <mailto: >> > st...@channing.harvard.edu> >> > >> >>>>> >> > >> >>>>> wrote: >> > >> >>>> >> > >> >>>> >> > >> >>>>> I typically get this info from Homo.sapiens. The result >> is >> > >> parasitic >> > >> >>>>> on >> > >> >>>>> the TxDb that is in there. I don't know how easy it is to >> > swap >> > >> >>>>> alternate >> > >> >>>>> TxDb in to get a different build. I think it would make >> > sense >> > >> to >> > >> >>>>> regard >> > >> >>>>> the OrganismDb instances as foundational for this sort of >> > >> structural >> > >> >>>>> data. >> > >> >>>>> >> > >> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen < >> > >> >>>>> kasperdanielhan...@gmail.com >> > >> <mailto:kasperdanielhan...@gmail.com>> wrote: >> > >> >>>>> >> > >> >>>>>> Let me rephrase this slightly. From one POV the purpose >> of >> > >> >>>>>> GenomeInfoDb >> > >> >>>>>> is >> > >> >>>>>> clean up the seqinfo slot. Currently it does most of the >> > >> cleaning, >> > >> >>>>>> but >> > >> >>>>>> it >> > >> >>>>>> does not add seqlengths. >> > >> >>>>>> >> > >> >>>>>> It is clear that seqlengths depends on the version of the >> > >> genome, but >> > >> >>>>>> I >> > >> >>>>>> will argue so does the seqnames. Of course, for human, >> > >> chr22 will not >> > >> >>>>>> change. But what about the names of all the random >> > >> contigs? Or for >> > >> >>>>>> other >> > >> >>>>>> organisms, what about going from a draft genome with 10k >> > >> contigs to a >> > >> >>>>>> more >> > >> >>>>>> completely genome assembled into fewer, larger >> chromosomes. >> > >> >>>>>> >> > >> >>>>>> I acknowledge that this information is present in the >> > BSgenome >> > >> >>>>>> packages, >> > >> >>>>>> but it seems (to me) to be very appropriate to have them >> > >> around for >> > >> >>>>>> cleaning up the seqinfo slot. For some situations it is >> not >> > >> great to >> > >> >>>>>> depend on 1 GB> download for something that is a few >> bytes. >> > >> >>>>>> >> > >> >>>>>> Best, >> > >> >>>>>> Kasper >> > >> >>>>>> >> > >> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. >> > >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> >> > >> >> > >> >>>>>> wrote: >> > >> >>>>>> >> > >> >>>>>>> It would be nice (for a number of reasons) to have >> > >> chromosome lengths >> > >> >>>>>>> readily available in a foundational package like >> > >> GenomeInfoDb, so >> > >> >>>>>>> that, >> > >> >>>>>>> say, >> > >> >>>>>>> >> > >> >>>>>>> data(seqinfo.hg19) >> > >> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ >> seqlevels(myResults) ] >> > >> >>>>>>> >> > >> >>>>>>> would work without issues. Is there any particular >> reason >> > >> this >> > >> >>>>>> >> > >> >>>>>> couldn't >> > >> >>>>>>> >> > >> >>>>>>> happen for the supported/available BSgenomes? It would >> > >> seem like a >> > >> >>>>>> >> > >> >>>>>> simple >> > >> >>>>>>> >> > >> >>>>>>> matter to do >> > >> >>>>>>> >> > >> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19) >> > >> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens) >> > >> >>>>>>> R> save(seqinfo.hg19, >> > >> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda") >> > >> >>>>>>> >> > >> >>>>>>> and be done with it until (say) the next release or next >> > >> released >> > >> >>>>>>> BSgenome. I considered looping through the following >> > >> BSgenomes >> > >> >>>>>> >> > >> >>>>>> myself... >> > >> >>>>>>> >> > >> >>>>>>> and if it isn't strongly opposed by (everyone) I may >> still >> > >> do exactly >> > >> >>>>>>> that. Seems useful, no? >> > >> >>>>>>> >> > >> >>>>>>> e.g. for the following 42 builds, >> > >> >>>>>>> >> > >> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", >> > >> available.genomes())), >> > >> >>>>>>> value=TRUE) >> > >> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8" >> > >> >>>>>>> "BSgenome.Celegans.UCSC.ce10" >> > >> >>>>>>> >> > >> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2" >> > >> "BSgenome.Celegans.UCSC.ce6" >> > >> >>>>>>> >> > >> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2" >> > >> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3" >> > >> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2" >> > >> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3" >> > >> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Drerio.UCSC.danRer5" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Drerio.UCSC.danRer7" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805" >> > >> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1" >> > >> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Ggallus.UCSC.galGal4" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38" >> > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17" >> > >> >>>>>>> >> > >> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18" >> > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19" >> > >> >>>>>>> >> > >> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38" >> > >> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0" >> > >> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Mmusculus.UCSC.mm10" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8" >> > >> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9" >> > >> >>>>>>> >> > >> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2" >> > >> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3" >> > >> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6" >> > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1" >> > >> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2" >> > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3" >> > >> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3" >> > >> >>>>>> >> > >> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1" >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> Am I insane for suggesting this? It would make things a >> > >> little >> > >> >>>>>>> easier >> > >> >>>>>> >> > >> >>>>>> for >> > >> >>>>>>> >> > >> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived >> > >> objects, blah, >> > >> >>>>>> >> > >> >>>>>> blah, >> > >> >>>>>>> >> > >> >>>>>>> blah... >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> Best, >> > >> >>>>>>> >> > >> >>>>>>> --t >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> >> > >> >>>>>>> Statistics is the grammar of science. >> > >> >>>>>>> Karl Pearson >> > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> > >> >>>>>> >> > >> >>>>>> >> > >> >>>>>> [[alternative HTML version deleted]] >> > >> >>>>>> >> > >> >>>>>> _______________________________________________ >> > >> >>>>>> Bioc-devel@r-project.org <mailto: >> Bioc-devel@r-project.org> >> > >> mailing list >> > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > >> >>> >> > >> >>> >> > >> >>> [[alternative HTML version deleted]] >> > >> >>> >> > >> >>> _______________________________________________ >> > >> >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> > >> mailing list >> > >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > >> >> >> > >> >> >> > >> >> _______________________________________________ >> > >> >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> > >> mailing list >> > >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > >> >> >> > >> > >> > >> > -- >> > >> > Hervé Pagès >> > >> > >> > >> > Program in Computational Biology >> > >> > Division of Public Health Sciences >> > >> > Fred Hutchinson Cancer Research Center >> > >> > 1100 Fairview Ave. N, M1-B514 >> > >> > P.O. Box 19024 >> > >> > Seattle, WA 98109-1024 >> > >> > >> > >> > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org> >> > >> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> > >> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> > >> > >> > >> > >> > >> > _______________________________________________ >> > >> > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> > >> mailing list >> > >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > >> >> > >> >> > > >> > > -- >> > > Hervé Pagès >> > > >> > > Program in Computational Biology >> > > Division of Public Health Sciences >> > > Fred Hutchinson Cancer Research Center >> > > 1100 Fairview Ave. N, M1-B514 >> > > P.O. Box 19024 >> > > Seattle, WA 98109-1024 >> > > >> > > E-mail: hpa...@fredhutch.org >> > > Phone: (206) 667-5791 >> > > Fax: (206) 667-1319 >> > >> > _______________________________________________ >> > Bioc-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > >> >> >> >> -- >> Gabriel Becker, Ph.D >> Computational Biologist >> Genentech Research >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > -- Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel