Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Fri, Jun 5, 2015 at 1:36 PM, Hervé Pagès <hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>> wrote:
On 06/05/2015 01:19 PM, Michael Lawrence wrote:
To support the multi-genome case, one could set the genome as a
vector, one value for each seqname, and it would fix the
style/seqlength per seqname. It could sort by the combination of
seqname and species. Presumably it would do nothing for unknown
genomes.
But I agree that a standardizeSeqinfo() that amounts to
"genome(x) <-
genome(x)" would make sense.
I don't think people sort too often by seqnames (except to the
"natural" ordering),
That's what sort(), order(), rank() do by default: they sort by seqnames
first, then by start, then by end, and finally by strand.
but I could be wrong. I do sympathize though with
the need for a low-level accessor. At least one would want a
parameter
for disabling the standardization.
Ok. So the candidates are:
(a) standardizeSeqinfo(gr) <- "hg19"
(b) gr <- standardizeSeqinfo(gr, "hg19")
(c) standardizeGenome(gr) <- "hg19"
(d) gr <- standardizeGenome(gr, "hg19")
(e) seqinfo(gr) <- "hg19"
Is there a risk of confusion with keepStandardChromosomes where
"standard" means a very different thing? I'll add 2 more:
(f) normalizeSeqinfo(gr) <- "hg19"
(g) gr <- normalizeSeqinfo(gr, "hg19")
Anyway, we're not here yet. As pointed in an earlier post, there are
still some missing pieces to complete the puzzle.
Thanks,
H.
On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
<kasperdanielhan...@gmail.com
<mailto:kasperdanielhan...@gmail.com>> wrote:
In WGBS we frequently sequence a human with spikein from the
lambda genome.
In this case, most of the chromosomes of the Granges are
from human, except
one. This is a usecase where genome(GR) is not constant. I
suggest, partly
for compatibility, to keep genome, but perhaps do something like
standardizeGenome()
or something like this.
I would indeed love, love, love a function which just cleans
it up.
Kasper
On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker
<becker.g...@gene.com <mailto:becker.g...@gene.com>> wrote:
Herve,
This is probably a naive question, but what usecases are
there for
creating
an object with the wrong seqinfo for its genome?
~G
On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
<lawrence.mich...@gene.com
<mailto:lawrence.mich...@gene.com>
wrote:
On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès
<hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>>
wrote:
I also think that we're heading towards
something like that.
So genome(gr) <- "hg19" would:
(a) Add any missing information to the seqinfo.
(b) Sort the seqlevels in "canonical" order.
(c) Change the seqlevels style to UCSC style
if they are not.
The 3 tasks are orthogonal. I guess most of the
times people want
an easy way to perform them all at once.
We could easily support (a) and (b). This
assumes that the current
seqlevels are already valid hg19 seqlevels:
si1 <- Seqinfo(c("chrX", "chrUn_gl000249",
"chr2", "chr6_cox_hap2"))
gr1 <- GRanges(seqinfo=si1)
hg19_si <- Seqinfo(genome="hg19")
## (a):
seqinfo(gr1) <- merge(seqinfo(gr1),
hg19_si)[seqlevels(gr1)]
seqinfo(gr1)
# Seqinfo object with 4 sequences (1
circular) from hg19 genome:
# seqnames seqlengths isCircular genome
# chrX 155270560 FALSE hg19
# chrUn_gl000249 38502 FALSE hg19
# chr2 243199373 FALSE hg19
# chr6_cox_hap2 4795371 FALSE hg19
## (b):
seqlevels(gr1) <-
intersect(seqlevels(hg19_si), seqlevels(gr1))
seqinfo(gr1)
# Seqinfo object with 4 sequences (1
circular) from hg19 genome:
# seqnames seqlengths isCircular genome
# chr2 243199373 FALSE hg19
# chrX 155270560 FALSE hg19
# chr6_cox_hap2 4795371 FALSE hg19
# chrUn_gl000249 38502 FALSE hg19
(c) is harder because seqlevelsStyle() doesn't
know how to rename
scaffolds yet:
si2 <- Seqinfo(c("X",
"HSCHRUN_RANDOM_CTG42", "2",
"HSCHR6_MHC_COX_CTG1"))
gr2 <- GRanges(seqinfo=si2)
seqlevelsStyle(gr2)
# [1] "NCBI"
seqlevelsStyle(gr2) <- "UCSC"
seqlevels(gr2)
# [1] "chrX"
"HSCHRUN_RANDOM_CTG42" "chr2"
# [4] "HSCHR6_MHC_COX_CTG1"
So we need to work on this.
I'm not sure about using genome(gr) <- "hg19"
for this. Right now
it sets the genome column of the seqinfo with
the supplied string
and nothing else. Aren't there valid use cases
for this?
Not sure. People would almost always want the
seqname style and order
to be consistent with the given genome.
What about
using seqinfo(gr) <- "hg19" instead? It kind of
suggests that the
whole seqinfo component actually gets filled.
Yea, but "genome" is so intuitive compared to "seqinfo".
H.
On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
that's kind of always been my goal...
Statistics is the grammar of science.
Karl Pearson
<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
<lawrence.mich...@gene.com
<mailto:lawrence.mich...@gene.com>
<mailto:lawrence.mich...@gene.com
<mailto:lawrence.mich...@gene.com>>> wrote:
Maybe this could eventually support
setting the seqinfo with:
genome(gr) <- "hg19"
Or is that being too clever?
On Thu, Jun 4, 2015 at 4:28 PM, Hervé
Pagès <hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>
<mailto:hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>>> wrote:
> Hi,
>
> FWIW I started to work on
supporting quick generation of a
standalone
> Seqinfo object via
Seqinfo(genome="hg38") in GenomeInfoDb.
>
> It already supports hg38, hg19,
hg18, panTro4, panTro3,
panTro2,
> bosTau8, bosTau7, bosTau6, canFam3,
canFam2, canFam1, musFur1,
mm10,
> mm9, mm8, susScr3, susScr2, rn6,
rheMac3, rheMac2, galGal4,
galGal3,
> gasAcu1, danRer7, apiMel2, dm6,
dm3, ce10, ce6, ce4, ce2,
sacCer3,
> and sacCer2. I'll add more.
>
> See ?Seqinfo for some examples.
>
> Right now it fetches the
information from internet every time
you
> call it but maybe we should just
store that information in the
> GenomeInfoDb package as Tim suggested?
>
> H.
>
>
> On 06/03/2015 12:54 PM, Tim Triche,
Jr. wrote:
>>
>> That would be perfect actually.
And it would radically
reduce &
>> modularize maintenance. Maybe
that's the best way to go
after
all. Quite
>> sensible.
>>
>> --t
>>
>>> On Jun 3, 2015, at 12:46 PM,
Vincent Carey
<st...@channing.harvard.edu
<mailto:st...@channing.harvard.edu>
<mailto:st...@channing.harvard.edu
<mailto:st...@channing.harvard.edu>>>
>>> wrote:
>>>
>>> It really isn't hard to have
multiple OrganismDb packages in
place -- the
>>> process of making new ones is
documented and was given as an
exercise in
>>> the EdX course. I don't know if
we want to institutionalize
it
and
>>> distribute such -- I think we
might, so that there would be
Hs19, Hs38,
>>> mm9, etc. packages. They have
very little content, they
just
coordinate
>>> interactions with packages that
you'll already have.
>>>
>>> On Wed, Jun 3, 2015 at 3:26 PM,
Tim Triche, Jr.
<tim.tri...@gmail.com
<mailto:tim.tri...@gmail.com>
<mailto:tim.tri...@gmail.com
<mailto:tim.tri...@gmail.com>>>
>>> wrote:
>>>
>>>> Right, I typically do that too,
and if you're working on
human
data it
>>>> isn't a big deal. What makes
things a lot more of a drag
is
when you
>>>> work
>>>> on e.g. mouse data (mm9 vs mm10,
aka GRCm37 vs GRCm38)
where
>>>> Mus.musculus
>>>> is essentially a "build ahead"
of Homo.sapiens.
>>>>
>>>> R> seqinfo(Homo.sapiens)
>>>> Seqinfo object with 93 sequences
(1 circular) from hg19
genome
>>>>
>>>> R> seqinfo(Mus.musculus)
>>>> Seqinfo object with 66 sequences
(1 circular) from mm10
genome:
>>>>
>>>> It's not as explicit as directly
assigning the seqinfo from
a
genome
>>>> that
>>>> corresponds to that of your
annotations/results/whatever.
I
know we
>>>> could
>>>> all use crossmap or liftOver or
whatever, but that's not
really the
>>>> same,
>>>> and it takes time, whereas
assigning the proper seqinfo for
>>>> relationships
>>>> is very fast.
>>>>
>>>> That's all I was getting at...
>>>>
>>>>
>>>> Statistics is the grammar of
science.
>>>> Karl Pearson
<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>
>>>> On Wed, Jun 3, 2015 at 12:17 PM,
Vincent Carey
>>>> <st...@channing.harvard.edu
<mailto:st...@channing.harvard.edu> <mailto:
st...@channing.harvard.edu
<mailto:st...@channing.harvard.edu>>
>>>>>
>>>>> wrote:
>>>>
>>>>
>>>>> I typically get this info from
Homo.sapiens. The result
is
parasitic
>>>>> on
>>>>> the TxDb that is in there. I
don't know how easy it is to
swap
>>>>> alternate
>>>>> TxDb in to get a different
build. I think it would make
sense
to
>>>>> regard
>>>>> the OrganismDb instances as
foundational for this sort of
structural
>>>>> data.
>>>>>
>>>>> On Wed, Jun 3, 2015 at 3:12 PM,
Kasper Daniel Hansen <
>>>>> kasperdanielhan...@gmail.com
<mailto:kasperdanielhan...@gmail.com>
<mailto:kasperdanielhan...@gmail.com
<mailto:kasperdanielhan...@gmail.com>>> wrote:
>>>>>
>>>>>> Let me rephrase this
slightly. From one POV the purpose
of
>>>>>> GenomeInfoDb
>>>>>> is
>>>>>> clean up the seqinfo slot.
Currently it does most of the
cleaning,
>>>>>> but
>>>>>> it
>>>>>> does not add seqlengths.
>>>>>>
>>>>>> It is clear that seqlengths
depends on the version of the
genome, but
>>>>>> I
>>>>>> will argue so does the
seqnames. Of course, for human,
chr22 will not
>>>>>> change. But what about the
names of all the random
contigs? Or for
>>>>>> other
>>>>>> organisms, what about going
from a draft genome with 10k
contigs to a
>>>>>> more
>>>>>> completely genome assembled
into fewer, larger
chromosomes.
>>>>>>
>>>>>> I acknowledge that this
information is present in the
BSgenome
>>>>>> packages,
>>>>>> but it seems (to me) to be
very appropriate to have them
around for
>>>>>> cleaning up the seqinfo slot.
For some situations it is
not
great to
>>>>>> depend on 1 GB> download for
something that is a few
bytes.
>>>>>>
>>>>>> Best,
>>>>>> Kasper
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 3:00
PM, Tim Triche, Jr.
<tim.tri...@gmail.com
<mailto:tim.tri...@gmail.com>
<mailto:tim.tri...@gmail.com
<mailto:tim.tri...@gmail.com>>>
>>>>>> wrote:
>>>>>>
>>>>>>> It would be nice (for a
number of reasons) to have
chromosome lengths
>>>>>>> readily available in a
foundational package like
GenomeInfoDb, so
>>>>>>> that,
>>>>>>> say,
>>>>>>>
>>>>>>> data(seqinfo.hg19)
>>>>>>> seqinfo(myResults) <-
seqinfo.hg19[ seqlevels(myResults)
]
>>>>>>>
>>>>>>> would work without issues.
Is there any particular
reason
this
>>>>>>
>>>>>> couldn't
>>>>>>>
>>>>>>> happen for the
supported/available BSgenomes? It would
seem like a
>>>>>>
>>>>>> simple
>>>>>>>
>>>>>>> matter to do
>>>>>>>
>>>>>>> R>
library(BSgenome.Hsapiens.UCSC.hg19)
>>>>>>> R> seqinfo.hg19 <-
seqinfo(Hsapiens)
>>>>>>> R> save(seqinfo.hg19,
>>>>>>>
file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>>>>>>
>>>>>>> and be done with it until
(say) the next release or next
released
>>>>>>> BSgenome. I considered
looping through the following
BSgenomes
>>>>>>
>>>>>> myself...
>>>>>>>
>>>>>>> and if it isn't strongly
opposed by (everyone) I may
still
do exactly
>>>>>>> that. Seems useful, no?
>>>>>>>
>>>>>>> e.g. for the following 42 builds,
>>>>>>>
>>>>>>> grep("(UCSC|NCBI)",
unique(gsub(".masked", "",
available.genomes())),
>>>>>>> value=TRUE)
>>>>>>> [1]
"BSgenome.Amellifera.UCSC.apiMel2"
>>>>>>
>>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>>>>>>>
>>>>>>>
>>>>>>> [3]
"BSgenome.Btaurus.UCSC.bosTau4"
>>>>>>
>>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>>>>>>
>>>>>>>
>>>>>>> [5]
"BSgenome.Btaurus.UCSC.bosTau8"
>>>>>>> "BSgenome.Celegans.UCSC.ce10"
>>>>>>>
>>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
"BSgenome.Celegans.UCSC.ce6"
>>>>>>>
>>>>>>> [9]
"BSgenome.Cfamiliaris.UCSC.canFam2"
>>>>>>>
"BSgenome.Cfamiliaris.UCSC.canFam3"
>>>>>>> [11]
"BSgenome.Dmelanogaster.UCSC.dm2"
>>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>>>>>>> [13]
"BSgenome.Dmelanogaster.UCSC.dm6"
>>>>>>
>>>>>> "BSgenome.Drerio.UCSC.danRer5"
>>>>>>>
>>>>>>>
>>>>>>> [15]
"BSgenome.Drerio.UCSC.danRer6"
>>>>>>
>>>>>> "BSgenome.Drerio.UCSC.danRer7"
>>>>>>>
>>>>>>>
>>>>>>> [17]
"BSgenome.Ecoli.NCBI.20080805"
>>>>>>>
"BSgenome.Gaculeatus.UCSC.gasAcu1"
>>>>>>> [19]
"BSgenome.Ggallus.UCSC.galGal3"
>>>>>>
>>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>>>>>>>
>>>>>>>
>>>>>>> [21]
"BSgenome.Hsapiens.NCBI.GRCh38"
>>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>>>>>>>
>>>>>>> [23]
"BSgenome.Hsapiens.UCSC.hg18"
>>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>>>>>>>
>>>>>>> [25]
"BSgenome.Hsapiens.UCSC.hg38"
>>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>>>>>>> [27]
"BSgenome.Mfuro.UCSC.musFur1"
>>>>>>
>>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>>>>>>>
>>>>>>>
>>>>>>> [29]
"BSgenome.Mmulatta.UCSC.rheMac3"
>>>>>>
>>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>>>>>>>
>>>>>>>
>>>>>>> [31]
"BSgenome.Mmusculus.UCSC.mm8"
>>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>>>>>>>
>>>>>>> [33]
"BSgenome.Ptroglodytes.UCSC.panTro2"
>>>>>>>
"BSgenome.Ptroglodytes.UCSC.panTro3"
>>>>>>> [35]
"BSgenome.Rnorvegicus.UCSC.rn4"
>>>>>>
>>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>>>>>>>
>>>>>>>
>>>>>>> [37]
"BSgenome.Rnorvegicus.UCSC.rn6"
>>>>>>>
"BSgenome.Scerevisiae.UCSC.sacCer1"
>>>>>>> [39]
"BSgenome.Scerevisiae.UCSC.sacCer2"
>>>>>>>
"BSgenome.Scerevisiae.UCSC.sacCer3"
>>>>>>> [41]
"BSgenome.Sscrofa.UCSC.susScr3"
>>>>>>
>>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am I insane for suggesting
this? It would make things a
little
>>>>>>> easier
>>>>>>
>>>>>> for
>>>>>>>
>>>>>>> rtracklayer, most
SummarizedExperiment and SE-derived
objects, blah,
>>>>>>
>>>>>> blah,
>>>>>>>
>>>>>>> blah...
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> --t
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Statistics is the grammar of
science.
>>>>>>> Karl Pearson
<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML
version deleted]]
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>>
mailing list
>>>>>>
https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>> [[alternative HTML version
deleted]]
>>>
>>>
_______________________________________________
>>> Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>>
mailing list
>>>
https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
_______________________________________________
>> Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>>
mailing list
>>
https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>
<mailto:hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>>
> Phone: (206) 667-5791
<tel:%28206%29%20667-5791>
<tel:%28206%29%20667-5791>
> Fax: (206) 667-1319
<tel:%28206%29%20667-1319>
<tel:%28206%29%20667-1319>
>
>
>
_______________________________________________
> Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org>>
mailing list
>
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
_______________________________________________
Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
_______________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel