Hi Hervé, I agree these transcript categories are not very specific, but perhaps still useful to at least know how many coding/non-coding genes are represented in a txdb instance coming form UCSC. If it is not much work for a magician like yourself to support those then it may be worth the effort? Thomas
On Tue, Dec 1, 2015 at 1:17 PM Hervé Pagès <hpa...@fredhutch.org> wrote: > Hi Malcolm, > > On 12/01/2015 12:29 PM, Cook, Malcolm wrote: > >> -----Original Message----- > > > From: Bioc-devel [mailto:bioc-devel-boun...@r-project.org] On > Behalf Of > > > Hervé Pagès > > > Sent: Monday, October 26, 2015 12:39 PM > > > To: Thomas Girke <thomas.gi...@ucr.edu>; Arora, Sonali > > > <sar...@fredhutch.org>; bioc-devel@r-project.org > > > Subject: Re: [Bioc-devel] systemPipeR error - Error in NSBS(i, x, > exact = exact, > > > upperBoundIsStrict = !allow.append) : > > > > > > Hi Thomas, > > > > > > On 10/25/2015 01:06 PM, Thomas Girke wrote: > > > > I fixed this in systemPipeR versions 1.4.3/1.5.3. The reason for > this error > > > > was that the tx_type column contains only NA values when a txdb is > > > generated with > > > > makeTxDbFromUCSC(). Returning here something more meaningful may be > > > useful, > > > > such as the transcript type information available when a txdb is > generated > > > > from a GFF. > > > > > > We've considered this and might do it at some point. The difficulty > > > though is that UCSC does not provide this information as part of > > > the track itself so we'll have to go grab it from some other table > > > in their huge db through many joins. In the mean time, I'll try to > > > clarify this in the documentation. > > > > > > > Or, you could pull it from BioMart > > That means you need to be able to map UCSC transcripts IDs (uc001aaa.3, > uc010nxr.1, etc...) to Ensembl transcript IDs (ENST00000456328, > ENST00000515242, etc...) with the usual complication that only a > fraction of UCSC IDs are unambiguously mappable. I don't know if that's > easier/better than grabbing the tx_type directly from UCSC. > > OTOH the closest thing to Ensembl transcript_biotype I see in the UCSC > db is the "category" field in the kgTxInfo table. For example, for hg19, > the information is here: > > > > https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=kgTxInfo > > It's only one join away from the knownGene table (via the "name" > column) so should not be too hard to get but it's not as rich as the > Ensembl transcript_biotype (only 3 possible values at the moment: > coding, nearCoding, noncoding). > > What do people think? > > H. > > > > > ~Malcolm > > > > > H. > > > > > > > > > > > Thanks, > > > > > > > > Thomas > > > > > > > > On Fri, Oct 23, 2015 at 12:49:09AM +0000, Thomas Girke wrote: > > > >> Thanks. Good to know. I have never tried this with an txdb > instance > > > >> from makeTxDbFromUCSC(). Will fix this over the weekend. > > > >> Thomas > > > >> > > > >> > > > >> > > > >> On Thu, Oct 22, 2015 at 5:39 PM Arora, Sonali < > sar...@fredhutch.org> > > > wrote: > > > >> > > > >> > > > >> Hi Thomas, > > > >> > > > >> I get the following error when I try to obtain the feature types > using > > > >> the function genFeatures() > > > >> > > > >> > > > >>> library(systemPipeR) > > > >>> library(GenomicFeatures) > > > >> Loading required package: AnnotationDbi > > > >>> txdb <- makeTxDbFromUCSC(genome = "hg19", tablename = "refGene") > > > >> Download the refGene table ... OK > > > >> Download the refLink table ... OK > > > >> Extract the 'transcripts' data frame ... OK > > > >> Extract the 'splicings' data frame ... OK > > > >> Download and preprocess the 'chrominfo' data frame ... OK > > > >> Prepare the 'metadata' data frame ... OK > > > >> Make the TxDb object ... OK > > > >> Warning message: > > > >> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : > > > >> UCSC data anomaly in 359 transcript(s): the cds cumulative length > is > > > >> not a multiple of 3 for transcripts 'NM_001037501' 'NM_001277444' > > > >> 'NM_001037675' 'NM_001271872' 'NM_001170637' 'NM_001300952' > > > >> 'NM_015326' 'NM_017940' 'NM_001271870' 'NM_001143962' > > > 'NM_001305275' > > > >> 'NM_001146344' 'NM_001300891' 'NM_001010890' 'NM_001300891' > > > >> 'NM_001289974' 'NM_001291281' 'NM_001301371' 'NM_016178' > > > >> 'NM_001134939' 'NM_001080427' 'NM_001145710' 'NM_001291328' > > > >> 'NM_001271466' 'NM_001017915' 'NM_005541' 'NM_000348' > > > 'NM_001145051' > > > >> 'NM_001135649' 'NM_001128929' 'NM_001080423' 'NM_001144382' > > > >> 'NM_001291661' 'NM_002958' 'NM_001005861' 'NM_004636' > > > 'NM_001005914' > > > >> 'NM_001290060' 'NM_001290061' 'NM_001289930' 'NM_003715' > > > >> 'NM_001290049' 'NM_001286054' 'NM_001286053' 'NM_001286052' > > > >> 'NM_182524' 'NM_001075' 'NM_00 [... truncated] > > > >>> feat <- genFeatures(txdb, featuretype="all", reduce_ranges=TRUE, > > > >> upstream=1000, > > > >> + downstream=0, verbose=TRUE) > > > >> Error in NSBS(i, x, exact = exact, upperBoundIsStrict = > !allow.append) : > > > >> subscript contains NAs > > > >> > > > >> > > > >> probably because - > > > >> > > > >> Browse[2]> tx > > > >> GRanges object with 54439 ranges and 3 metadata columns: > > > >> seqnames ranges strand | tx_name > > > >> <Rle> <IRanges> <Rle> | <character> > > > >> [1] chr1 [11874, 14409] + | NR_046018 > > > >> [2] chr1 [30366, 30503] + | NR_036051 > > > >> [3] chr1 [30366, 30503] + | NR_036266 > > > >> [4] chr1 [30366, 30503] + | NR_036267 > > > >> [5] chr1 [30366, 30503] + | NR_036268 > > > >> ... ... ... ... ... ... > > > >> [54435] chrUn_gl000228 [112605, 114676] + | NM_001306068 > > > >> [54436] chrUn_gl000228 [ 29339, 32226] - | NM_001005217 > > > >> [54437] chrUn_gl000228 [ 29339, 32226] - | NM_001286820 > > > >> [54438] chrUn_gl000241 [ 14739, 36767] - | NR_132315 > > > >> [54439] chrUn_gl000241 [ 16025, 36957] - | NR_132320 > > > >> gene_id tx_type > > > >> <CharacterList> <character> > > > >> [1] 100287102 <NA> > > > >> [2] 100302278 <NA> > > > >> [3] 100422831 <NA> > > > >> [4] 100422834 <NA> > > > >> [5] 100422919 <NA> > > > >> ... ... ... > > > >> [54435] 100288687 <NA> > > > >> [54436] 448831 <NA> > > > >> [54437] 448831 <NA> > > > >> [54438] 100289097 <NA> > > > >> [54439] 102723780 <NA> > > > >> ------- > > > >> seqinfo: 93 sequences (1 circular) from hg19 genome > > > >> Browse[2]> unique(mcols(tx)$tx_type) > > > >> [1] NA > > > >> debug: tmp <- tx[mcols(tx)$tx_type == tx_type[i]] > > > >> Browse[2]> > > > >> Error in NSBS(i, x, exact = exact, upperBoundIsStrict = > !allow.append) : > > > >> subscript contains NAs > > > >> > > > >> > > > >> Here is my sessionInfo > > > >> > > > >>> sessionInfo() > > > >> R Under development (unstable) (2015-10-15 r69519) > > > >> Platform: x86_64-pc-linux-gnu (64-bit) > > > >> Running under: Ubuntu 14.04.2 LTS > > > >> > > > >> locale: > > > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > > >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > > >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > > > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > >> > > > >> attached base packages: > > > >> [1] parallel stats4 stats graphics grDevices utils datasets > > > >> [8] methods base > > > >> > > > >> other attached packages: > > > >> [1] GenomicFeatures_1.23.3 AnnotationDbi_1.33.0 > > > >> [3] systemPipeR_1.5.1 RSQLite_1.0.0 > > > >> [5] DBI_0.3.1 ShortRead_1.25.10 > > > >> [7] GenomicAlignments_1.7.1 SummarizedExperiment_1.1.0 > > > >> [9] Biobase_2.31.0 BiocParallel_1.5.0 > > > >> [11] Rsamtools_1.23.0 Biostrings_2.39.0 > > > >> [13] XVector_0.11.0 GenomicRanges_1.21.32 > > > >> [15] GenomeInfoDb_1.7.1 IRanges_2.5.3 > > > >> [17] S4Vectors_0.9.5 BiocGenerics_0.17.0 > > > >> > > > >> loaded via a namespace (and not attached): > > > >> [1] Rcpp_0.12.1 lattice_0.20-33 GO.db_3.2.2 > > > >> [4] digest_0.6.8 plyr_1.8.3 futile.options_1.0.0 > > > >> [7] BatchJobs_1.6 ggplot2_1.0.1 zlibbioc_1.17.0 > > > >> [10] annotate_1.49.0 Matrix_1.2-2 checkmate_1.6.2 > > > >> [13] proto_0.3-10 GOstats_2.37.0 splines_3.3.0 > > > >> [16] stringr_1.0.0 pheatmap_1.0.7 RCurl_1.95-4.7 > > > >> [19] biomaRt_2.27.0 munsell_0.4.2 sendmailR_1.2-1 > > > >> [22] rtracklayer_1.31.1 base64enc_0.1-3 BBmisc_1.9 > > > >> [25] fail_1.3 edgeR_3.13.0 XML_3.98-1.3 > > > >> [28] AnnotationForge_1.13.0 MASS_7.3-44 bitops_1.0-6 > > > >> [31] grid_3.3.0 RBGL_1.47.0 xtable_1.7-4 > > > >> [34] GSEABase_1.33.0 gtable_0.1.2 magrittr_1.5 > > > >> [37] scales_0.3.0 graph_1.49.1 stringi_1.0-1 > > > >> [40] hwriter_1.3.2 reshape2_1.4.1 genefilter_1.53.0 > > > >> [43] limma_3.27.0 latticeExtra_0.6-26 futile.logger_1.4.1 > > > >> [46] brew_1.0-6 rjson_0.2.15 lambda.r_1.1.7 > > > >> [49] RColorBrewer_1.1-2 tools_3.3.0 Category_2.37.0 > > > >> [52] survival_2.38-3 colorspace_1.2-6 > > > >> > > > >> > > > >> > > > >> > > > >> -- > > > >> Thanks and Regards, > > > >> Sonali > > > >> > > > >> > > > >> > > > > > > > > _______________________________________________ > > > > Bioc-devel@r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > > > > > -- > > > Hervé Pagès > > > > > > Program in Computational Biology > > > Division of Public Health Sciences > > > Fred Hutchinson Cancer Research Center > > > 1100 Fairview Ave. N, M1-B514 > > > P.O. Box 19024 > > > Seattle, WA 98109-1024 > > > > > > E-mail: hpa...@fredhutch.org > > > Phone: (206) 667-5791 > > > Fax: (206) 667-1319 > > > > > > _______________________________________________ > > > Bioc-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel