update: this approach bases on Martin's suggestion and I think is ideal for EnsDb or any Ensembl based annotation packages. It's not required to install an explicit genome package, but load (and eventually cache) dynamically the correct genomic sequence from Ensembl via AnnotationHub. That way chromosome names, variants etc are perfectly aligned. Also, this guarantees that the genome version and gene definitions, both from Ensembl match.
Example code (is now also included in the ensembldb vignette): library(AnnotationHub) library(EnsDb.Hsapiens.v75) library(Rsamtools) ah <- AnnotationHub() edb <- EnsDb.Hsapiens.v75 ## get the Ensembl version eVersion <- metadata(edb)[metadata(edb)$name=="ensembl_version", "value"] ## query all available files for the Ensembl version eData <- query(ah, c(organism(edb), paste0("release-", eVersion))) eData ## retrieve the *dna.toplevel.fa file; this might take some time. Dna <- ah[["AH20439"]] ## generate an index if none is available if(is.na(index(Dna))){ indexFa(Dna) Dna <- FaFile(path(Dna)) } ## get start/end coordinates of all genes genes <- genes(edb) ## subset to all genes that are encoded on chromosomes for which ## we do have DNA sequence available. genes <- genes[seqnames(genes) %in% seqnames(seqinfo(Dna))] ## get the gene sequences, i.e. the sequence including the sequence of ## all of the gene's exons and introns geneSeqs <- getSeq(Dna, genes) cheers, jo On 09 Jun 2015, at 11:38, Rainer Johannes <johannes.rai...@eurac.edu<mailto:johannes.rai...@eurac.edu>> wrote: dear Ludwig, On 09 Jun 2015, at 10:46, Ludwig Geistlinger <ludwig.geistlin...@bio.ifi.lmu.de<mailto:ludwig.geistlin...@bio.ifi.lmu.de>> wrote: Dear Johannes, Thx for providing the great EnsDb packages! One question: As of now, I am able to choose between TxDb and EnsDb for genomic coordinates of genomic features such as genes, transcripts, and exons. For the sequences themselves I need the corresponding BSgenome package. While it is easy to automatically map from a specific TxDb package (eg TxDb.Hsapiens.UCSC.<hg38>.knownGene) to the corresponding BSgenome package (here: BSgenome.Hsapiens.UCSC.<hg38>), I wonder how to do that for an EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no information about the genome build. A cumbersome option would be to extract the genome_build from the metadata of the EnsDb package (which would give me for EnsDb.Hsapiens.v79: 'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38). This however needs all BSgenome.Hsapiens packages installed and takes thus too much time and space for a programmatic access. Can you suggest a better way to map from coordinates to sequence (within the BioC annotation functionality)? agree, there's no easy mapping (yet). I'll implement a method "suggestGenomePackage" in the ensembldb package. In the long run I hope that also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will become available for all species... that would make the mapping much easier... cheers, jo Thanks & Best, Ludwig dear Robert and Ludwig, the EnsDb packages provide all the gene/transcript etc annotations for all genes defined in the Ensembl database (for a given species and Ensembl release). Except the column/attribute "entrezid" that is stored in the internal database there is however no link to NCBI or UCSC annotations. So, basically, if you want to use "pure" Ensembl based annotations: use EnsDb, if you want to have the UCSC annotations: use the TxDb packages. In case you need EnsDbs of other species or Ensembl versions, the ensembldb package provides functionality to generate such packages either using the Ensembl Perl API or using GTF files provided by Ensembl. If you have problems building the packages, just drop me a line and I'll do that. cheers, jo On 03 Jun 2015, at 15:56, Robert M. Flight <rfligh...@gmail.com<mailto:rfligh...@gmail.com>> wrote: Ludwig, If you do this search on the UCSC genome browser (which this annotation package is built from), you will see that the longest variant is what is shown http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885 If instead of "genes" you do "transcripts", you will see 20 different transcripts for this gene, including the one listed by NCBI. I havent tried it yet (haven't upgraded R or bioconductor to latest version), but there is now an Ensembl based annotation package as well, that may work better?? http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html -Robert On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger < ludwig.geistlin...@bio.ifi.lmu.de<mailto:ludwig.geistlin...@bio.ifi.lmu.de>> wrote: Dear Bioc annotation team, Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for BRCA1; ENSG00000012048; entrez:672 via genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672")) gives me: GRanges object with 1 range and 1 metadata column: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> 672 chr17 [43044295, 43170403] - | 672 ------- seqinfo: 455 sequences (1 circular) from hg38 genome However, querying Ensembl and NCBI Gene http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048 http://www.ncbi.nlm.nih.gov/gene/672 the gene is located at (note the difference in the end position) Chromosome 17: 43,044,295-43,125,483 reverse strand How is the inconsistency explained and how to extract an ENSEMBL/NCBI conform annotation from the TxDb object? (I am aware of biomaRt, but I want to explicitely use the Bioc annotation functionality). Thanks! Ludwig -- Dipl.-Bioinf. Ludwig Geistlinger Lehr- und Forschungseinheit für Bioinformatik Institut für Informatik Ludwig-Maximilians-Universität München Amalienstrasse 17, 2. Stock, Büro A201 80333 München Tel.: 089-2180-4067 eMail: ludwig.geistlin...@bio.ifi.lmu.de<mailto:ludwig.geistlin...@bio.ifi.lmu.de> _______________________________________________ Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel