On Wed, Oct 22, 2014 at 6:59 PM, Leonardo Collado Torres <lcoll...@jhu.edu> wrote: > Hi Sonali, > > The recent changes in GenomeInfoDb (1.3.3) look great! > > > > On Wed, Oct 22, 2014 at 1:53 PM, Sonali Arora <sar...@fhcrc.org> wrote: >> >> Hi Leo, >> >> To summarize, These are the concerns raised in your previous email : >> >> a) mapSeqlevels() should return current style along with other mapped styles? >> look at extendedMapSeqlevels() >> >> I am looking into this. I have a few thoughts/ questions for you. >> >> The main idea of mapSeqlevels() is to map the current seqlevelStyle to other >> seqlevelStyles. >> > > Totally! > >> "I was also expecting mapSeqlevels() to return the same input if the >> specified 'style' was the same as the one currently being used." >> >> >> Returning the existing seqlevelStyle along with the mapped Seqlevel style >> sounds a >> little redundant. (given that the user is already supplying it while calling >> mapSeqlevels() >> so he already knows it!) >> >> "For >> example, if I was working with Homo sapiens NCBI style and attempted >> to map to NCBI, I was expecting the same output as the input." >> >> >> why would you want to map it back to NCBI ? Please explain your use case more >> concretely. > > I agree with you that my example doesn't make sense if a user is > manually calling mapSeqlevels(). > > In `derfinder`, several functions split the work by sequence name > (chromosomes). Others take two GRanges objects and match them. Others > create GRanges objects. Others load data from BAM or BigWig files. > > In my use case, I wanted to make sure things would work even if the > user didn't make sure that (in the 2 GRanges case) that the objects > had the same sequence naming style. That is why the user doesn't > necessarily specify the `style` and I was using GenomeInfoDb to guess > it via `seqlevelsStyle()` as well as mapping the sequence names to the > default `style` to make sure the playing field was the same before > finding overlaps, etc via `mapSeqlevels()`. > > This means that sometimes I am working with a single sequence name. > Hence the `seqlevelsStyle('2')` example which is a different scenario > from working with multiple sequence names. > >> seqlevelsStyle('2') > warning! Multiple seqnameStyles found. > [1] "NCBI" "TAIR10" "MSU6" "JGI2" "AGPvF" >> seqlevelsStyle(as.character(1:13)) > [1] "NCBI" > > I'm aware that I could simply dodge using mapSeqlevels() when the > default `style` is the same as the `style` currently used in the input > as shown below: > >> foo <- function(seqnames, style = 'UCSC') { > + if(seqlevelsStyle(seqnames) != style) { > + res <- mapSeqlevels(seqnames, style) > + } else { > + res <- seqnames > + } > + return(res) > + } >> > ## Works with multiple seqnames >> foo(as.character(1:22)) > 1 2 3 4 5 6 7 8 > "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" > 9 10 11 12 13 14 15 16 > "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" > 17 18 19 20 21 22 > "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" > > ## You also have to guess the species in using a small number of sequence > names > ## Or use only the first result: seqlevelsStyle(seqnames)[1] >> foo('2') > warning! Multiple seqnameStyles found. > [1] "chr2" > Warning message: > In if (seqlevelsStyle(seqnames) != style) { : > the condition has length > 1 and only the first element will be used >> > > The second part (if you can call it like that) of my use case is that > I wanted to use the same code whether or not the sequence names / > organism supplied are not supported by GenomeInfoDb. That is, I wanted > to be flexible enough to use all the power from `GenomeInfoDb` when > there's enough information, and simply ignore the step of renaming > sequence styles if the information is missing. > `extendedMapSeqlevels()` can do this. > > Alternatively, I could also use some tryCatch() calls and if I get an > error (as illustrated below) then simply avoid using `mapSeqlevels()`. > > ## Could also be '39' instead of 'seq1' if for some reason someone is > working with a genome that has 39 chromosomes. > ## Currently the supported organism in GenomeInfoDb with the highest > number is Canis familiaris with 38 autosomal chrs. >> seqlevelsStyle('seq1') > Error in seqlevelsStyle("seq1") : > The style does not have a compatible entry for the species > supported by Seqname. Please see genomeStyles() for > supported species/style > > ## Modified foo() which can work with unsupported genomes/styles > >> zoo <- function(seqnames, style = 'UCSC') { > + currentStyle <- tryCatch(seqlevelsStyle(seqnames), error = > function(e) { 'error' }) > + if(currentStyle[1] == 'error') { > + return(seqnames) > + } else { > + ## Or use currentStyle[1] if you don't want the warning in zoo('2') > + if(currentStyle != style) { > + res <- mapSeqlevels(seqnames, style) > + } else { > + res<- seqnames > + } > + } > + return(seqnames) > + } >> zoo('seq1') > [1] "seq1" > > `extendedMapSeqlevels()` does more than this by guessing the current > style only for valid species. If the species is not supplied, it > guesses it as well. > > `zoo()` above can be modified to take a `species` argument and/or > guess it by using `.guessSpeciesStyle()`. > >> >> I'd be happy to make changes to mapSeqlevels() and look at your function >> more deeply, >> once I understand the use case better. Thanks for clarifying this. > > No problem =) Thanks for digesting my email! > >> >> b) Do you think it would be helpful to you and other developers of I export >> .guessSpeciesStyle() and .supportedSeqnameMappings() from GenomeInfoDb ? > > Say if `mapSeqlevels()` had a `species` argument and > `guessSpeciesStyle()` was exported I could simplify > `extendedMapSeqlevels()` [maybe a different name for it would be > better] for my use case. > > I think that guessing the correct `style` without the `species` > information is going to get harder as the number of supported > organisms increases in `GenomeInfoDb`. Hence why I like > `guessSpeciesStyle()` more than `seqlevelsStyle()`.
I'm guessing that the main use of `seqlevelsStyle(x)` is when `x` is a GRanges or GRangesList object that has the organism information. Then the `species` can be inferred without error and making it much more straight forward to guess the naming style used. > > Basically, I would use `guessSpeciesStyle()` to restrict the valid > `style` guesses to those corresponding to the `species` of interest, > or guess everything if nothing is supplied. Then with `mapSeqlevels()` > I could specify which mapping to use. Or alternatively use > `supportedSeqnameMappings()` if you think that `mapSeqlevels()` > shouldn't have a species argument. > > Anyhow, I don't think that it's critical that you export these > functions. I don't think that there is a problem with having the > single NOTE in R CMD check. > >> >> c) .guessSpeciesStyle('2') should return all possible hits for "seqnames" >> >> This was a great find! I have fixed it in version 1.3.2 - Thanks for >> figuring this out ! >> Does the following style of output make sense to you ? If multiple hits are >> found >> there are returned in "species" and "style" respectively. >> >> > .guessSpeciesStyle(c(paste0("chr",1:10))) >> $species >> [1] "Homo sapiens" "Mus musculus" >> >> $style >> [1] "UCSC" "UCSC" >> >> > .guessSpeciesStyle(c(paste0("chr",1:22))) >> $species >> [1] "Homo sapiens" >> >> $style >> [1] "UCSC" >> >> > .guessSpeciesStyle("chr2") >> $species >> [1] "Homo sapiens" "Mus musculus" >> >> $style >> [1] "UCSC" "UCSC" >> >> > .guessSpeciesStyle("2") >> $species >> [1] "Arabidopsis thaliana" "Arabidopsis thaliana" "Cyanidioschyzon >> merolae" >> [4] "Homo sapiens" "Mus musculus" "Oryza sativa" >> [7] "Oryza sativa" "Populus trichocarpa" "Zea mays" >> [10] "Zea mays" >> >> $style >> [1] "NCBI" "TAIR10" "NCBI" "NCBI" "NCBI" "NCBI" "MSU6" "JGI2" >> "NCBI" "AGPvF" >> > > Awesome! I think that these changes are great! I'll shortly modify my > code to use this version. > >> SeqlevelsStyle also returns multiple styles now (if it cant guess the >> correct one!) >> > > =) > >> > seqnames <- c(paste0("chr",1:22)) >> > seqlevelsStyle(seqnames) >> [1] "UCSC" >> >> > seqnames <- "2" >> > seqlevelsStyle(seqnames) >> warning! Multiple seqnameStyles found. >> [1] "NCBI" "TAIR10" "MSU6" "JGI2" "AGPvF" >> >> Thanks again for your feedback, >> Sonali. >> >> > > Let me know if you need anything else! > > Cheers, > Leo > >> >> >> >> >> On 10/21/2014 8:47 PM, Maintainer wrote: >> >> Hello, >> >> In my `derfinder` package I have been relying on `GenomeInfoDb` to >> correctly name the sequence names. Given >> https://support.bioconductor.org/p/62136/ I now realize that I was >> expecting too much from `GenomeInfoDb`. >> >> For example, in some cases I was expecting it to guess that a single >> '2' meant that the naming style was NCBI. This is true for Homo >> sapiens but now I realize that it's not necessary true for Arabidopsis >> thaliana. >> >> GenomeInfoDb:::.guessSpeciesStyle('2') >> >> [1] "Arabidopsis thaliana" "NCBI" >> >> ## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI >> >> GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ] >> >> NCBI UCSC dbSNP >> 2 2 chr2 ch2 >> >> GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ] >> >> NCBI TAIR10 >> 2 2 2 >> >> I was also expecting mapSeqlevels() to return the same input if the >> specified 'style' was the same as the one currently being used. For >> example, if I was working with Homo sapiens NCBI style and attempted >> to map to NCBI, I was expecting the same output as the input. >> >> ## More than 1 output >> >> mapSeqlevels('2', 'NCBI') >> >> 2 >> [1,] "2" >> [2,] "LGII" >> ## Turns out there is another organism that would make this mapping valid >> >> GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ] >> >> NCBI JGI2.F >> 2 LGII 2 >> >> >> I also noticed that `GenomeInfoDb` couldn't work with some fictional >> examples. For example: >> >> ## Expected error, but this meant that `derfinder` couldn't use toy >> examples from other >> ## packages. >> >> seqlevelsStyle('seq1') >> >> Error in seqlevelsStyle("seq1") : >> The style does not have a compatible entry for the species supported >> by Seqname. Please see genomeStyles() for supported >> species/style >> >> ## Using toy data from Rsamtools fails as expected >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools", >> >> + mustWork=TRUE) >> >> library(GenomicAlignments) >> names(scanBamHeader(BamFile(fl))$targets) >> >> [1] "seq1" "seq2" >> >> seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) >> >> Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) : >> The style does not have a compatible entry for the species supported >> by Seqname. Please see genomeStyles() for supported >> species/style >> >> >> So, I wanted a way to use `GenomeInfoDb` to make the correct naming >> style maps for supported genomes, and simply return the original >> sequence names if using an unsupported species/mapping or if the >> original and target styles are the same. >> >> >> That's what I basically did when I wrote 'extendedMapSeqlevels' (aye, >> long name) which you can see at >> https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others >> might be interested in such a function, then by all means please add >> it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to >> guess the species and/or current naming style when that information is >> not provided. When it's used with verbose = TRUE it will show a >> message explaining why the mapping failed (if it did), it'll show the >> link to the `GenomeInfoDb` vignette for adding an organism, and then >> return the original sequence names. >> >> I have a slightly modified version in derfinder >> https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R >> or >> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R >> where I simply assume some other default values and guarantee >> continuity with how I was doing things previously. >> >> Naturally, R CMD check now shows a NOTE because I'm using the ::: >> syntax for un-exported functions from `GenomeInfoDb`. >> >> Cheers, >> Leo >> >> >> Leonardo Collado Torres, PhD Candidate >> Department of Biostatistics >> Johns Hopkins University >> Bloomberg School of Public Health >> Website: http://www.biostat.jhsph.edu/~lcollado/ >> Blog: http://lcolladotor.github.io/ >> >> ________________________________________________________________________ >> devteam-bioc mailing list >> To unsubscribe from this mailing list send a blank email to >> devteam-bioc-le...@lists.fhcrc.org >> You can also unsubscribe or change your personal options at >> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc >> >> >> -- >> Thanks and Regards, >> Sonali >> _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel