Hello, In my `derfinder` package I have been relying on `GenomeInfoDb` to correctly name the sequence names. Given https://support.bioconductor.org/p/62136/ I now realize that I was expecting too much from `GenomeInfoDb`.
For example, in some cases I was expecting it to guess that a single '2' meant that the naming style was NCBI. This is true for Homo sapiens but now I realize that it's not necessary true for Arabidopsis thaliana. > GenomeInfoDb:::.guessSpeciesStyle('2') [1] "Arabidopsis thaliana" "NCBI" ## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI > GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ] NCBI UCSC dbSNP 2 2 chr2 ch2 > GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ] NCBI TAIR10 2 2 2 I was also expecting mapSeqlevels() to return the same input if the specified 'style' was the same as the one currently being used. For example, if I was working with Homo sapiens NCBI style and attempted to map to NCBI, I was expecting the same output as the input. ## More than 1 output > mapSeqlevels('2', 'NCBI') 2 [1,] "2" [2,] "LGII" ## Turns out there is another organism that would make this mapping valid > GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ] NCBI JGI2.F 2 LGII 2 I also noticed that `GenomeInfoDb` couldn't work with some fictional examples. For example: ## Expected error, but this meant that `derfinder` couldn't use toy examples from other ## packages. > seqlevelsStyle('seq1') Error in seqlevelsStyle("seq1") : The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style ## Using toy data from Rsamtools fails as expected > fl <- system.file("extdata", "ex1.bam", package="Rsamtools", + mustWork=TRUE) > library(GenomicAlignments) > names(scanBamHeader(BamFile(fl))$targets) [1] "seq1" "seq2" > seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) : The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style So, I wanted a way to use `GenomeInfoDb` to make the correct naming style maps for supported genomes, and simply return the original sequence names if using an unsupported species/mapping or if the original and target styles are the same. That's what I basically did when I wrote 'extendedMapSeqlevels' (aye, long name) which you can see at https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others might be interested in such a function, then by all means please add it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to guess the species and/or current naming style when that information is not provided. When it's used with verbose = TRUE it will show a message explaining why the mapping failed (if it did), it'll show the link to the `GenomeInfoDb` vignette for adding an organism, and then return the original sequence names. I have a slightly modified version in derfinder https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R where I simply assume some other default values and guarantee continuity with how I was doing things previously. Naturally, R CMD check now shows a NOTE because I'm using the ::: syntax for un-exported functions from `GenomeInfoDb`. Cheers, Leo Leonardo Collado Torres, PhD Candidate Department of Biostatistics Johns Hopkins University Bloomberg School of Public Health Website: http://www.biostat.jhsph.edu/~lcollado/ Blog: http://lcolladotor.github.io/ _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel