In WGBS we frequently sequence a human with spikein from the lambda genome. In this case, most of the chromosomes of the Granges are from human, except one. This is a usecase where genome(GR) is not constant. I suggest, partly for compatibility, to keep genome, but perhaps do something like standardizeGenome() or something like this.
I would indeed love, love, love a function which just cleans it up. Kasper On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker <becker.g...@gene.com> wrote: > Herve, > > This is probably a naive question, but what usecases are there for creating > an object with the wrong seqinfo for its genome? > > ~G > > On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence < > lawrence.mich...@gene.com > > wrote: > > > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpa...@fredhutch.org> > wrote: > > > I also think that we're heading towards something like that. > > > > > > So genome(gr) <- "hg19" would: > > > > > > (a) Add any missing information to the seqinfo. > > > (b) Sort the seqlevels in "canonical" order. > > > (c) Change the seqlevels style to UCSC style if they are not. > > > > > > The 3 tasks are orthogonal. I guess most of the times people want > > > an easy way to perform them all at once. > > > > > > We could easily support (a) and (b). This assumes that the current > > > seqlevels are already valid hg19 seqlevels: > > > > > > si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2")) > > > gr1 <- GRanges(seqinfo=si1) > > > hg19_si <- Seqinfo(genome="hg19") > > > > > > ## (a): > > > seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] > > > seqinfo(gr1) > > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: > > > # seqnames seqlengths isCircular genome > > > # chrX 155270560 FALSE hg19 > > > # chrUn_gl000249 38502 FALSE hg19 > > > # chr2 243199373 FALSE hg19 > > > # chr6_cox_hap2 4795371 FALSE hg19 > > > > > > ## (b): > > > seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1)) > > > seqinfo(gr1) > > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: > > > # seqnames seqlengths isCircular genome > > > # chr2 243199373 FALSE hg19 > > > # chrX 155270560 FALSE hg19 > > > # chr6_cox_hap2 4795371 FALSE hg19 > > > # chrUn_gl000249 38502 FALSE hg19 > > > > > > (c) is harder because seqlevelsStyle() doesn't know how to rename > > > scaffolds yet: > > > > > > si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", > > "HSCHR6_MHC_COX_CTG1")) > > > gr2 <- GRanges(seqinfo=si2) > > > > > > seqlevelsStyle(gr2) > > > # [1] "NCBI" > > > > > > seqlevelsStyle(gr2) <- "UCSC" > > > seqlevels(gr2) > > > # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2" > > > # [4] "HSCHR6_MHC_COX_CTG1" > > > > > > So we need to work on this. > > > > > > I'm not sure about using genome(gr) <- "hg19" for this. Right now > > > it sets the genome column of the seqinfo with the supplied string > > > and nothing else. Aren't there valid use cases for this? > > > > Not sure. People would almost always want the seqname style and order > > to be consistent with the given genome. > > > > > What about > > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the > > > whole seqinfo component actually gets filled. > > > > > > > Yea, but "genome" is so intuitive compared to "seqinfo". > > > > > > > > > H. > > > > > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: > > >> > > >> that's kind of always been my goal... > > >> > > >> > > >> Statistics is the grammar of science. > > >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> > > >> > > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence > > >> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote: > > >> > > >> Maybe this could eventually support setting the seqinfo with: > > >> > > >> genome(gr) <- "hg19" > > >> > > >> Or is that being too clever? > > >> > > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpa...@fredhutch.org > > >> <mailto:hpa...@fredhutch.org>> wrote: > > >> > Hi, > > >> > > > >> > FWIW I started to work on supporting quick generation of a > > >> standalone > > >> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb. > > >> > > > >> > It already supports hg38, hg19, hg18, panTro4, panTro3, > panTro2, > > >> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, > > >> mm10, > > >> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, > > >> galGal3, > > >> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, > > sacCer3, > > >> > and sacCer2. I'll add more. > > >> > > > >> > See ?Seqinfo for some examples. > > >> > > > >> > Right now it fetches the information from internet every time > you > > >> > call it but maybe we should just store that information in the > > >> > GenomeInfoDb package as Tim suggested? > > >> > > > >> > H. > > >> > > > >> > > > >> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote: > > >> >> > > >> >> That would be perfect actually. And it would radically > reduce & > > >> >> modularize maintenance. Maybe that's the best way to go after > > >> all. Quite > > >> >> sensible. > > >> >> > > >> >> --t > > >> >> > > >> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey > > >> <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>> > > >> >>> wrote: > > >> >>> > > >> >>> It really isn't hard to have multiple OrganismDb packages in > > >> place -- the > > >> >>> process of making new ones is documented and was given as an > > >> exercise in > > >> >>> the EdX course. I don't know if we want to institutionalize > it > > >> and > > >> >>> distribute such -- I think we might, so that there would be > > >> Hs19, Hs38, > > >> >>> mm9, etc. packages. They have very little content, they just > > >> coordinate > > >> >>> interactions with packages that you'll already have. > > >> >>> > > >> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. > > >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> > > >> > > >> >>> wrote: > > >> >>> > > >> >>>> Right, I typically do that too, and if you're working on > human > > >> data it > > >> >>>> isn't a big deal. What makes things a lot more of a drag is > > >> when you > > >> >>>> work > > >> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where > > >> >>>> Mus.musculus > > >> >>>> is essentially a "build ahead" of Homo.sapiens. > > >> >>>> > > >> >>>> R> seqinfo(Homo.sapiens) > > >> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 > genome > > >> >>>> > > >> >>>> R> seqinfo(Mus.musculus) > > >> >>>> Seqinfo object with 66 sequences (1 circular) from mm10 > > genome: > > >> >>>> > > >> >>>> It's not as explicit as directly assigning the seqinfo from > a > > >> genome > > >> >>>> that > > >> >>>> corresponds to that of your annotations/results/whatever. I > > >> know we > > >> >>>> could > > >> >>>> all use crossmap or liftOver or whatever, but that's not > > >> really the > > >> >>>> same, > > >> >>>> and it takes time, whereas assigning the proper seqinfo for > > >> >>>> relationships > > >> >>>> is very fast. > > >> >>>> > > >> >>>> That's all I was getting at... > > >> >>>> > > >> >>>> > > >> >>>> Statistics is the grammar of science. > > >> >>>> Karl Pearson > > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> > > >> >>>> > > >> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey > > >> >>>> <st...@channing.harvard.edu <mailto: > > st...@channing.harvard.edu> > > >> >>>>> > > >> >>>>> wrote: > > >> >>>> > > >> >>>> > > >> >>>>> I typically get this info from Homo.sapiens. The result is > > >> parasitic > > >> >>>>> on > > >> >>>>> the TxDb that is in there. I don't know how easy it is to > > swap > > >> >>>>> alternate > > >> >>>>> TxDb in to get a different build. I think it would make > > sense > > >> to > > >> >>>>> regard > > >> >>>>> the OrganismDb instances as foundational for this sort of > > >> structural > > >> >>>>> data. > > >> >>>>> > > >> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen < > > >> >>>>> kasperdanielhan...@gmail.com > > >> <mailto:kasperdanielhan...@gmail.com>> wrote: > > >> >>>>> > > >> >>>>>> Let me rephrase this slightly. From one POV the purpose > of > > >> >>>>>> GenomeInfoDb > > >> >>>>>> is > > >> >>>>>> clean up the seqinfo slot. Currently it does most of the > > >> cleaning, > > >> >>>>>> but > > >> >>>>>> it > > >> >>>>>> does not add seqlengths. > > >> >>>>>> > > >> >>>>>> It is clear that seqlengths depends on the version of the > > >> genome, but > > >> >>>>>> I > > >> >>>>>> will argue so does the seqnames. Of course, for human, > > >> chr22 will not > > >> >>>>>> change. But what about the names of all the random > > >> contigs? Or for > > >> >>>>>> other > > >> >>>>>> organisms, what about going from a draft genome with 10k > > >> contigs to a > > >> >>>>>> more > > >> >>>>>> completely genome assembled into fewer, larger > chromosomes. > > >> >>>>>> > > >> >>>>>> I acknowledge that this information is present in the > > BSgenome > > >> >>>>>> packages, > > >> >>>>>> but it seems (to me) to be very appropriate to have them > > >> around for > > >> >>>>>> cleaning up the seqinfo slot. For some situations it is > not > > >> great to > > >> >>>>>> depend on 1 GB> download for something that is a few > bytes. > > >> >>>>>> > > >> >>>>>> Best, > > >> >>>>>> Kasper > > >> >>>>>> > > >> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. > > >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> > > >> > > >> >>>>>> wrote: > > >> >>>>>> > > >> >>>>>>> It would be nice (for a number of reasons) to have > > >> chromosome lengths > > >> >>>>>>> readily available in a foundational package like > > >> GenomeInfoDb, so > > >> >>>>>>> that, > > >> >>>>>>> say, > > >> >>>>>>> > > >> >>>>>>> data(seqinfo.hg19) > > >> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) > ] > > >> >>>>>>> > > >> >>>>>>> would work without issues. Is there any particular > reason > > >> this > > >> >>>>>> > > >> >>>>>> couldn't > > >> >>>>>>> > > >> >>>>>>> happen for the supported/available BSgenomes? It would > > >> seem like a > > >> >>>>>> > > >> >>>>>> simple > > >> >>>>>>> > > >> >>>>>>> matter to do > > >> >>>>>>> > > >> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19) > > >> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens) > > >> >>>>>>> R> save(seqinfo.hg19, > > >> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda") > > >> >>>>>>> > > >> >>>>>>> and be done with it until (say) the next release or next > > >> released > > >> >>>>>>> BSgenome. I considered looping through the following > > >> BSgenomes > > >> >>>>>> > > >> >>>>>> myself... > > >> >>>>>>> > > >> >>>>>>> and if it isn't strongly opposed by (everyone) I may > still > > >> do exactly > > >> >>>>>>> that. Seems useful, no? > > >> >>>>>>> > > >> >>>>>>> e.g. for the following 42 builds, > > >> >>>>>>> > > >> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", > > >> available.genomes())), > > >> >>>>>>> value=TRUE) > > >> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2" > > >> >>>>>> > > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4" > > >> >>>>>> > > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8" > > >> >>>>>>> "BSgenome.Celegans.UCSC.ce10" > > >> >>>>>>> > > >> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2" > > >> "BSgenome.Celegans.UCSC.ce6" > > >> >>>>>>> > > >> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2" > > >> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3" > > >> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2" > > >> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3" > > >> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6" > > >> >>>>>> > > >> >>>>>> "BSgenome.Drerio.UCSC.danRer5" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6" > > >> >>>>>> > > >> >>>>>> "BSgenome.Drerio.UCSC.danRer7" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805" > > >> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1" > > >> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3" > > >> >>>>>> > > >> >>>>>> "BSgenome.Ggallus.UCSC.galGal4" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38" > > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17" > > >> >>>>>>> > > >> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18" > > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19" > > >> >>>>>>> > > >> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38" > > >> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0" > > >> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1" > > >> >>>>>> > > >> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3" > > >> >>>>>> > > >> >>>>>> "BSgenome.Mmusculus.UCSC.mm10" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8" > > >> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9" > > >> >>>>>>> > > >> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2" > > >> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3" > > >> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4" > > >> >>>>>> > > >> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6" > > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1" > > >> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2" > > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3" > > >> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3" > > >> >>>>>> > > >> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1" > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> Am I insane for suggesting this? It would make things a > > >> little > > >> >>>>>>> easier > > >> >>>>>> > > >> >>>>>> for > > >> >>>>>>> > > >> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived > > >> objects, blah, > > >> >>>>>> > > >> >>>>>> blah, > > >> >>>>>>> > > >> >>>>>>> blah... > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> Best, > > >> >>>>>>> > > >> >>>>>>> --t > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> Statistics is the grammar of science. > > >> >>>>>>> Karl Pearson > > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> [[alternative HTML version deleted]] > > >> >>>>>> > > >> >>>>>> _______________________________________________ > > >> >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org > > > > >> mailing list > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > >> >>> > > >> >>> > > >> >>> [[alternative HTML version deleted]] > > >> >>> > > >> >>> _______________________________________________ > > >> >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > > >> mailing list > > >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > >> >> > > >> >> > > >> >> _______________________________________________ > > >> >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > > >> mailing list > > >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > >> >> > > >> > > > >> > -- > > >> > Hervé Pagès > > >> > > > >> > Program in Computational Biology > > >> > Division of Public Health Sciences > > >> > Fred Hutchinson Cancer Research Center > > >> > 1100 Fairview Ave. N, M1-B514 > > >> > P.O. Box 19024 > > >> > Seattle, WA 98109-1024 > > >> > > > >> > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org> > > >> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > > >> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > >> > > > >> > > > >> > _______________________________________________ > > >> > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > > >> mailing list > > >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > >> > > >> > > > > > > -- > > > Hervé Pagès > > > > > > Program in Computational Biology > > > Division of Public Health Sciences > > > Fred Hutchinson Cancer Research Center > > > 1100 Fairview Ave. N, M1-B514 > > > P.O. Box 19024 > > > Seattle, WA 98109-1024 > > > > > > E-mail: hpa...@fredhutch.org > > > Phone: (206) 667-5791 > > > Fax: (206) 667-1319 > > > > _______________________________________________ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > -- > Gabriel Becker, Ph.D > Computational Biologist > Genentech Research > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel