Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Hervé Pagès Fri, 05 Jun 2015 13:16:01 -0700

Hi Gabe,

I wouldn't say "wrong" but "not normalized". So maybe for example
everything is ok with their seqinfo but the seqlevels are not in
canonical order, and, for whatever reason, they want to keep them
that way (the actual order of the seqlevels impacts the sorting of
the ranges in the GRanges object). If it turns out that the genome
column is empty and they want to set it, then they would be left
with no way to do so.


Another concern I have is that genome<- has been the low-level setter
for the genome column for a while and changing its behavior now might
break some existing code. It's fixable, but we can avoid that.

H.

On 06/05/2015 11:51 AM, Gabe Becker wrote:

Herve,

This is probably a naive question, but what usecases are there for
creating an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
<lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:

    On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpa...@fredhutch.org
    <mailto:hpa...@fredhutch.org>> wrote:
     > I also think that we're heading towards something like that.
     >
     > So genome(gr) <- "hg19" would:
     >
     >   (a) Add any missing information to the seqinfo.
     >   (b) Sort the seqlevels in "canonical" order.
     >   (c) Change the seqlevels style to UCSC style if they are not.
     >
     > The 3 tasks are orthogonal. I guess most of the times people want
     > an easy way to perform them all at once.
     >
     > We could easily support (a) and (b). This assumes that the current
     > seqlevels are already valid hg19 seqlevels:
     >
     >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
    "chr6_cox_hap2"))
     >   gr1 <- GRanges(seqinfo=si1)
     >   hg19_si <- Seqinfo(genome="hg19")
     >
     >   ## (a):
     >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
     >   seqinfo(gr1)
     >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
     >   #   seqnames       seqlengths isCircular genome
     >   #   chrX            155270560      FALSE   hg19
     >   #   chrUn_gl000249      38502      FALSE   hg19
     >   #   chr2            243199373      FALSE   hg19
     >   #   chr6_cox_hap2     4795371      FALSE   hg19
     >
     >   ## (b):
     >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
     >   seqinfo(gr1)
     >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
     >   #   seqnames       seqlengths isCircular genome
     >   #   chr2            243199373      FALSE   hg19
     >   #   chrX            155270560      FALSE   hg19
     >   #   chr6_cox_hap2     4795371      FALSE   hg19
     >   #   chrUn_gl000249      38502      FALSE   hg19
     >
     > (c) is harder because seqlevelsStyle() doesn't know how to rename
     > scaffolds yet:
     >
     >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
    "HSCHR6_MHC_COX_CTG1"))
     >   gr2 <- GRanges(seqinfo=si2)
     >
     >   seqlevelsStyle(gr2)
     >   # [1] "NCBI"
     >
     >   seqlevelsStyle(gr2) <- "UCSC"
     >   seqlevels(gr2)
     >   # [1] "chrX"                 "HSCHRUN_RANDOM_CTG42" "chr2"
     >   # [4] "HSCHR6_MHC_COX_CTG1"
     >
     > So we need to work on this.
     >
     > I'm not sure about using genome(gr) <- "hg19" for this. Right now
     > it sets the genome column of the seqinfo with the supplied string
     > and nothing else. Aren't there valid use cases for this?

    Not sure. People would almost always want the seqname style and order
    to be consistent with the given genome.

    > What about
    > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
    > whole seqinfo component actually gets filled.
    >

    Yea, but "genome" is so intuitive compared to "seqinfo".



     > H.
     >
     > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
     >>
     >> that's kind of always been my goal...
     >>
     >>
     >> Statistics is the grammar of science.
     >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
     >>
     >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
     >> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>
    <mailto:lawrence.mich...@gene.com
    <mailto:lawrence.mich...@gene.com>>> wrote:
     >>
     >>     Maybe this could eventually support setting the seqinfo with:
     >>
     >>     genome(gr) <- "hg19"
     >>
     >>     Or is that being too clever?
     >>
     >>     On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès
    <hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
     >>     <mailto:hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>>>
    wrote:
     >>      > Hi,
     >>      >
     >>      > FWIW I started to work on supporting quick generation of a
     >> standalone
     >>      > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
     >>      >
     >>      > It already supports hg38, hg19, hg18, panTro4, panTro3,
    panTro2,
     >>      > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1,
    musFur1,
     >> mm10,
     >>      > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
     >> galGal3,
     >>      > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
    sacCer3,
     >>      > and sacCer2. I'll add more.
     >>      >
     >>      > See ?Seqinfo for some examples.
     >>      >
     >>      > Right now it fetches the information from internet every
    time you
     >>      > call it but maybe we should just store that information
    in the
     >>      > GenomeInfoDb package as Tim suggested?
     >>      >
     >>      > H.
     >>      >
     >>      >
     >>      > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
     >>      >>
     >>      >> That would be perfect actually.  And it would radically
    reduce &
     >>      >> modularize maintenance.  Maybe that's the best way to go
    after
     >>     all.  Quite
     >>      >> sensible.
     >>      >>
     >>      >> --t
     >>      >>
     >>      >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
     >>     <st...@channing.harvard.edu
    <mailto:st...@channing.harvard.edu>
    <mailto:st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>>>
     >>      >>> wrote:
     >>      >>>
     >>      >>> It really isn't hard to have multiple OrganismDb
    packages in
     >>     place -- the
     >>      >>> process of making new ones is documented and was given
    as an
     >>     exercise in
     >>      >>> the EdX course.  I don't know if we want to
    institutionalize it
     >> and
     >>      >>> distribute such -- I think we might, so that there would be
     >>     Hs19, Hs38,
     >>      >>> mm9, etc. packages.  They have very little content,
    they just
     >>     coordinate
     >>      >>> interactions with packages that you'll already have.
     >>      >>>
     >>      >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
     >>     <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>
    <mailto:tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>>>
     >>
     >>      >>> wrote:
     >>      >>>
     >>      >>>> Right, I typically do that too, and if you're working
    on human
     >>     data it
     >>      >>>> isn't a big deal.  What makes things a lot more of a
    drag is
     >>     when you
     >>      >>>> work
     >>      >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38)
    where
     >>      >>>> Mus.musculus
     >>      >>>> is essentially a "build ahead" of Homo.sapiens.
     >>      >>>>
     >>      >>>> R> seqinfo(Homo.sapiens)
     >>      >>>> Seqinfo object with 93 sequences (1 circular) from
    hg19 genome
     >>      >>>>
     >>      >>>> R> seqinfo(Mus.musculus)
     >>      >>>> Seqinfo object with 66 sequences (1 circular) from
    mm10 genome:
     >>      >>>>
     >>      >>>> It's not as explicit as directly assigning the seqinfo
    from a
     >>     genome
     >>      >>>> that
     >>      >>>> corresponds to that of your
    annotations/results/whatever.  I
     >>     know we
     >>      >>>> could
     >>      >>>> all use crossmap or liftOver or whatever, but that's not
     >>     really the
     >>      >>>> same,
     >>      >>>> and it takes time, whereas assigning the proper
    seqinfo for
     >>      >>>> relationships
     >>      >>>> is very fast.
     >>      >>>>
     >>      >>>> That's all I was getting at...
     >>      >>>>
     >>      >>>>
     >>      >>>> Statistics is the grammar of science.
     >>      >>>> Karl Pearson
     >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
     >>      >>>>
     >>      >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
     >>      >>>> <st...@channing.harvard.edu
    <mailto:st...@channing.harvard.edu>
    <mailto:st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>>
     >>      >>>>>
     >>      >>>>> wrote:
     >>      >>>>
     >>      >>>>
     >>      >>>>> I typically get this info from Homo.sapiens.  The
    result is
     >>     parasitic
     >>      >>>>> on
     >>      >>>>> the TxDb that is in there.  I don't know how easy it
    is to swap
     >>      >>>>> alternate
     >>      >>>>> TxDb in to get a different build.  I think it would
    make sense
     >> to
     >>      >>>>> regard
     >>      >>>>> the OrganismDb instances as foundational for this sort of
     >>     structural
     >>      >>>>> data.
     >>      >>>>>
     >>      >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
     >>      >>>>> kasperdanielhan...@gmail.com
    <mailto:kasperdanielhan...@gmail.com>
     >>     <mailto:kasperdanielhan...@gmail.com
    <mailto:kasperdanielhan...@gmail.com>>> wrote:
     >>      >>>>>
     >>      >>>>>> Let me rephrase this slightly.  From one POV the
    purpose of
     >>      >>>>>> GenomeInfoDb
     >>      >>>>>> is
     >>      >>>>>> clean up the seqinfo slot.  Currently it does most
    of the
     >>     cleaning,
     >>      >>>>>> but
     >>      >>>>>> it
     >>      >>>>>> does not add seqlengths.
     >>      >>>>>>
     >>      >>>>>> It is clear that seqlengths depends on the version
    of the
     >>     genome, but
     >>      >>>>>> I
     >>      >>>>>> will argue so does the seqnames.  Of course, for human,
     >>     chr22 will not
     >>      >>>>>> change.  But what about the names of all the random
     >>     contigs?  Or for
     >>      >>>>>> other
     >>      >>>>>> organisms, what about going from a draft genome with 10k
     >>     contigs to a
     >>      >>>>>> more
     >>      >>>>>> completely genome assembled into fewer, larger
    chromosomes.
     >>      >>>>>>
     >>      >>>>>> I acknowledge that this information is present in
    the BSgenome
     >>      >>>>>> packages,
     >>      >>>>>> but it seems (to me) to be very appropriate to have them
     >>     around for
     >>      >>>>>> cleaning up the seqinfo slot.  For some situations
    it is not
     >>     great to
     >>      >>>>>> depend on 1 GB> download for something that is a few
    bytes.
     >>      >>>>>>
     >>      >>>>>> Best,
     >>      >>>>>> Kasper
     >>      >>>>>>
     >>      >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
     >>     <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>
    <mailto:tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>>>
     >>
     >>      >>>>>> wrote:
     >>      >>>>>>
     >>      >>>>>>> It would be nice (for a number of reasons) to have
     >>     chromosome lengths
     >>      >>>>>>> readily available in a foundational package like
     >>     GenomeInfoDb, so
     >>      >>>>>>> that,
     >>      >>>>>>> say,
     >>      >>>>>>>
     >>      >>>>>>> data(seqinfo.hg19)
     >>      >>>>>>> seqinfo(myResults) <- seqinfo.hg19[
    seqlevels(myResults) ]
     >>      >>>>>>>
     >>      >>>>>>> would work without issues.  Is there any particular
    reason
     >> this
     >>      >>>>>>
     >>      >>>>>> couldn't
     >>      >>>>>>>
     >>      >>>>>>> happen for the supported/available BSgenomes?  It would
     >>     seem like a
     >>      >>>>>>
     >>      >>>>>> simple
     >>      >>>>>>>
     >>      >>>>>>> matter to do
     >>      >>>>>>>
     >>      >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
     >>      >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
     >>      >>>>>>> R> save(seqinfo.hg19,
     >>      >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
     >>      >>>>>>>
     >>      >>>>>>> and be done with it until (say) the next release or
    next
     >>     released
     >>      >>>>>>> BSgenome.  I considered looping through the following
     >> BSgenomes
     >>      >>>>>>
     >>      >>>>>> myself...
     >>      >>>>>>>
     >>      >>>>>>> and if it isn't strongly opposed by (everyone) I
    may still
     >>     do exactly
     >>      >>>>>>> that.  Seems useful, no?
     >>      >>>>>>>
     >>      >>>>>>> e.g. for the following 42 builds,
     >>      >>>>>>>
     >>      >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
     >>     available.genomes())),
     >>      >>>>>>> value=TRUE)
     >>      >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
     >>      >>>>>>> "BSgenome.Celegans.UCSC.ce10"
     >>      >>>>>>>
     >>      >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
     >>       "BSgenome.Celegans.UCSC.ce6"
     >>      >>>>>>>
     >>      >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
     >>      >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
     >>      >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
     >>      >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
     >>      >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Drerio.UCSC.danRer5"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Drerio.UCSC.danRer7"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
     >>      >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
     >>      >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
     >>      >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
     >>      >>>>>>>
     >>      >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
     >>      >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
     >>      >>>>>>>
     >>      >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
     >>      >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
     >>      >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
     >>      >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
     >>      >>>>>>>
     >>      >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
     >>      >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
     >>      >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
     >>      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
     >>      >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
     >>      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
     >>      >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
     >>      >>>>>>
     >>      >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> Am I insane for suggesting this?  It would make
    things a
     >> little
     >>      >>>>>>> easier
     >>      >>>>>>
     >>      >>>>>> for
     >>      >>>>>>>
     >>      >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
     >>     objects, blah,
     >>      >>>>>>
     >>      >>>>>> blah,
     >>      >>>>>>>
     >>      >>>>>>> blah...
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> Best,
     >>      >>>>>>>
     >>      >>>>>>> --t
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>>
     >>      >>>>>>> Statistics is the grammar of science.
     >>      >>>>>>> Karl Pearson
     >>     <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
     >>      >>>>>>
     >>      >>>>>>
     >>      >>>>>>         [[alternative HTML version deleted]]
     >>      >>>>>>
     >>      >>>>>> _______________________________________________
     >>      >>>>>> Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org>>
     >>     mailing list
     >>      >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>      >>>
     >>      >>>
     >>      >>>     [[alternative HTML version deleted]]
     >>      >>>
     >>      >>> _______________________________________________
     >>      >>> Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org>>
     >>     mailing list
     >>      >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>      >>
     >>      >>
     >>      >> _______________________________________________
     >>      >> Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org>>
     >>     mailing list
     >>      >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>      >>
     >>      >
     >>      > --
     >>      > Hervé Pagès
     >>      >
     >>      > Program in Computational Biology
     >>      > Division of Public Health Sciences
     >>      > Fred Hutchinson Cancer Research Center
     >>      > 1100 Fairview Ave. N, M1-B514
     >>      > P.O. Box 19024
     >>      > Seattle, WA 98109-1024
     >>      >
     >>      > E-mail: hpa...@fredhutch.org
    <mailto:hpa...@fredhutch.org> <mailto:hpa...@fredhutch.org
    <mailto:hpa...@fredhutch.org>>
     >>      > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    <tel:%28206%29%20667-5791>
     >>      > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
    <tel:%28206%29%20667-1319>
     >>      >
     >>      >
     >>      > _______________________________________________
     >>      > Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org
    <mailto:Bioc-devel@r-project.org>>
     >>     mailing list
     >>      > https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>
     >>
     >
     > --
     > Hervé Pagès
     >
     > Program in Computational Biology
     > Division of Public Health Sciences
     > Fred Hutchinson Cancer Research Center
     > 1100 Fairview Ave. N, M1-B514
     > P.O. Box 19024
     > Seattle, WA 98109-1024
     >
     > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
     > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
     > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

    _______________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Reply via email to