Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Hervé Pagès Thu, 04 Jun 2015 23:50:13 -0700

I also think that we're heading towards something like that.


So genome(gr) <- "hg19" would:

  (a) Add any missing information to the seqinfo.
  (b) Sort the seqlevels in "canonical" order.
  (c) Change the seqlevels style to UCSC style if they are not.

The 3 tasks are orthogonal. I guess most of the times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This assumes that the current
seqlevels are already valid hg19 seqlevels:

  si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
  gr1 <- GRanges(seqinfo=si1)
  hg19_si <- Seqinfo(genome="hg19")

  ## (a):
  seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
  seqinfo(gr1)
  # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
  #   seqnames       seqlengths isCircular genome
  #   chrX            155270560      FALSE   hg19
  #   chrUn_gl000249      38502      FALSE   hg19
  #   chr2            243199373      FALSE   hg19
  #   chr6_cox_hap2     4795371      FALSE   hg19

  ## (b):
  seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
  seqinfo(gr1)
  # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
  #   seqnames       seqlengths isCircular genome
  #   chr2            243199373      FALSE   hg19
  #   chrX            155270560      FALSE   hg19
  #   chr6_cox_hap2     4795371      FALSE   hg19
  #   chrUn_gl000249      38502      FALSE   hg19

(c) is harder because seqlevelsStyle() doesn't know how to rename
scaffolds yet:

si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2","HSCHR6_MHC_COX_CTG1"))

  gr2 <- GRanges(seqinfo=si2)

  seqlevelsStyle(gr2)
  # [1] "NCBI"

  seqlevelsStyle(gr2) <- "UCSC"
  seqlevels(gr2)

# [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"

  # [4] "HSCHR6_MHC_COX_CTG1"

So we need to work on this.

I'm not sure about using genome(gr) <- "hg19" for this. Right now
it sets the genome column of the seqinfo with the supplied string
and nothing else. Aren't there valid use cases for this? What about
using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
whole seqinfo component actually gets filled.

H.

On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:

that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
<lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:

    Maybe this could eventually support setting the seqinfo with:

    genome(gr) <- "hg19"

    Or is that being too clever?

    On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpa...@fredhutch.org
    <mailto:hpa...@fredhutch.org>> wrote:
     > Hi,
     >
     > FWIW I started to work on supporting quick generation of a standalone
     > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
     >
     > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
     > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
     > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
     > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
     > and sacCer2. I'll add more.
     >
     > See ?Seqinfo for some examples.
     >
     > Right now it fetches the information from internet every time you
     > call it but maybe we should just store that information in the
     > GenomeInfoDb package as Tim suggested?
     >
     > H.
     >
     >
     > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
     >>
     >> That would be perfect actually.  And it would radically reduce &
     >> modularize maintenance.  Maybe that's the best way to go after
    all.  Quite
     >> sensible.
     >>
     >> --t
     >>
     >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
    <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>>
     >>> wrote:
     >>>
     >>> It really isn't hard to have multiple OrganismDb packages in
    place -- the
     >>> process of making new ones is documented and was given as an
    exercise in
     >>> the EdX course.  I don't know if we want to institutionalize it and
     >>> distribute such -- I think we might, so that there would be
    Hs19, Hs38,
     >>> mm9, etc. packages.  They have very little content, they just
    coordinate
     >>> interactions with packages that you'll already have.
     >>>
     >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
    <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>>
     >>> wrote:
     >>>
     >>>> Right, I typically do that too, and if you're working on human
    data it
     >>>> isn't a big deal.  What makes things a lot more of a drag is
    when you
     >>>> work
     >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
     >>>> Mus.musculus
     >>>> is essentially a "build ahead" of Homo.sapiens.
     >>>>
     >>>> R> seqinfo(Homo.sapiens)
     >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
     >>>>
     >>>> R> seqinfo(Mus.musculus)
     >>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
     >>>>
     >>>> It's not as explicit as directly assigning the seqinfo from a
    genome
     >>>> that
     >>>> corresponds to that of your annotations/results/whatever.  I
    know we
     >>>> could
     >>>> all use crossmap or liftOver or whatever, but that's not
    really the
     >>>> same,
     >>>> and it takes time, whereas assigning the proper seqinfo for
     >>>> relationships
     >>>> is very fast.
     >>>>
     >>>> That's all I was getting at...
     >>>>
     >>>>
     >>>> Statistics is the grammar of science.
     >>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
     >>>>
     >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
     >>>> <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>
     >>>>>
     >>>>> wrote:
     >>>>
     >>>>
     >>>>> I typically get this info from Homo.sapiens.  The result is
    parasitic
     >>>>> on
     >>>>> the TxDb that is in there.  I don't know how easy it is to swap
     >>>>> alternate
     >>>>> TxDb in to get a different build.  I think it would make sense to
     >>>>> regard
     >>>>> the OrganismDb instances as foundational for this sort of
    structural
     >>>>> data.
     >>>>>
     >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
     >>>>> kasperdanielhan...@gmail.com
    <mailto:kasperdanielhan...@gmail.com>> wrote:
     >>>>>
     >>>>>> Let me rephrase this slightly.  From one POV the purpose of
     >>>>>> GenomeInfoDb
     >>>>>> is
     >>>>>> clean up the seqinfo slot.  Currently it does most of the
    cleaning,
     >>>>>> but
     >>>>>> it
     >>>>>> does not add seqlengths.
     >>>>>>
     >>>>>> It is clear that seqlengths depends on the version of the
    genome, but
     >>>>>> I
     >>>>>> will argue so does the seqnames.  Of course, for human,
    chr22 will not
     >>>>>> change.  But what about the names of all the random
    contigs?  Or for
     >>>>>> other
     >>>>>> organisms, what about going from a draft genome with 10k
    contigs to a
     >>>>>> more
     >>>>>> completely genome assembled into fewer, larger chromosomes.
     >>>>>>
     >>>>>> I acknowledge that this information is present in the BSgenome
     >>>>>> packages,
     >>>>>> but it seems (to me) to be very appropriate to have them
    around for
     >>>>>> cleaning up the seqinfo slot.  For some situations it is not
    great to
     >>>>>> depend on 1 GB> download for something that is a few bytes.
     >>>>>>
     >>>>>> Best,
     >>>>>> Kasper
     >>>>>>
     >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
    <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>>
     >>>>>> wrote:
     >>>>>>
     >>>>>>> It would be nice (for a number of reasons) to have
    chromosome lengths
     >>>>>>> readily available in a foundational package like
    GenomeInfoDb, so
     >>>>>>> that,
     >>>>>>> say,
     >>>>>>>
     >>>>>>> data(seqinfo.hg19)
     >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
     >>>>>>>
     >>>>>>> would work without issues.  Is there any particular reason this
     >>>>>>
     >>>>>> couldn't
     >>>>>>>
     >>>>>>> happen for the supported/available BSgenomes?  It would
    seem like a
     >>>>>>
     >>>>>> simple
     >>>>>>>
     >>>>>>> matter to do
     >>>>>>>
     >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
     >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
     >>>>>>> R> save(seqinfo.hg19,
     >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
     >>>>>>>
     >>>>>>> and be done with it until (say) the next release or next
    released
     >>>>>>> BSgenome.  I considered looping through the following BSgenomes
     >>>>>>
     >>>>>> myself...
     >>>>>>>
     >>>>>>> and if it isn't strongly opposed by (everyone) I may still
    do exactly
     >>>>>>> that.  Seems useful, no?
     >>>>>>>
     >>>>>>> e.g. for the following 42 builds,
     >>>>>>>
     >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
    available.genomes())),
     >>>>>>> value=TRUE)
     >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
     >>>>>>
     >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
     >>>>>>>
     >>>>>>>
     >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
     >>>>>>
     >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
     >>>>>>>
     >>>>>>>
     >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
     >>>>>>> "BSgenome.Celegans.UCSC.ce10"
     >>>>>>>
     >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
      "BSgenome.Celegans.UCSC.ce6"
     >>>>>>>
     >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
     >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
     >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
     >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
     >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
     >>>>>>
     >>>>>> "BSgenome.Drerio.UCSC.danRer5"
     >>>>>>>
     >>>>>>>
     >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
     >>>>>>
     >>>>>> "BSgenome.Drerio.UCSC.danRer7"
     >>>>>>>
     >>>>>>>
     >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
     >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
     >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
     >>>>>>
     >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
     >>>>>>>
     >>>>>>>
     >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
     >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
     >>>>>>>
     >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
     >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
     >>>>>>>
     >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
     >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
     >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
     >>>>>>
     >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
     >>>>>>>
     >>>>>>>
     >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
     >>>>>>
     >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
     >>>>>>>
     >>>>>>>
     >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
     >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
     >>>>>>>
     >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
     >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
     >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
     >>>>>>
     >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
     >>>>>>>
     >>>>>>>
     >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
     >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
     >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
     >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
     >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
     >>>>>>
     >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
     >>>>>>>
     >>>>>>>
     >>>>>>>
     >>>>>>>
     >>>>>>>
     >>>>>>> Am I insane for suggesting this?  It would make things a little
     >>>>>>> easier
     >>>>>>
     >>>>>> for
     >>>>>>>
     >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
    objects, blah,
     >>>>>>
     >>>>>> blah,
     >>>>>>>
     >>>>>>> blah...
     >>>>>>>
     >>>>>>>
     >>>>>>> Best,
     >>>>>>>
     >>>>>>> --t
     >>>>>>>
     >>>>>>>
     >>>>>>>
     >>>>>>>
     >>>>>>> Statistics is the grammar of science.
     >>>>>>> Karl Pearson
    <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
     >>>>>>
     >>>>>>
     >>>>>>         [[alternative HTML version deleted]]
     >>>>>>
     >>>>>> _______________________________________________
     >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>>
     >>>
     >>>     [[alternative HTML version deleted]]
     >>>
     >>> _______________________________________________
     >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>
     >>
     >> _______________________________________________
     >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>
     >
     > --
     > Hervé Pagès
     >
     > Program in Computational Biology
     > Division of Public Health Sciences
     > Fred Hutchinson Cancer Research Center
     > 1100 Fairview Ave. N, M1-B514
     > P.O. Box 19024
     > Seattle, WA 98109-1024
     >
     > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
     > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
     > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
     >
     >
     > _______________________________________________
     > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     > https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Reply via email to