Hi Thomas,

On 12/01/2015 03:50 PM, Thomas Girke wrote:
Hi Hervé,
I agree these transcript categories are not very specific, but perhaps
still useful to at least know how many coding/non-coding genes are
represented in a txdb instance coming form UCSC. If it is not much work
for a magician like yourself to support those then it may be worth the
effort?

We're adding this to our backlog. Will take a couple of weeks before
we get this done.

Cheers,
H.

Thomas

On Tue, Dec 1, 2015 at 1:17 PM Hervé Pagès <hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>> wrote:

    Hi Malcolm,

    On 12/01/2015 12:29 PM, Cook, Malcolm wrote:
     >> -----Original Message-----
     >   > From: Bioc-devel [mailto:bioc-devel-boun...@r-project.org
    <mailto:bioc-devel-boun...@r-project.org>] On Behalf Of
     >   > Hervé Pagès
     >   > Sent: Monday, October 26, 2015 12:39 PM
     >   > To: Thomas Girke <thomas.gi...@ucr.edu
    <mailto:thomas.gi...@ucr.edu>>; Arora, Sonali
     >   > <sar...@fredhutch.org <mailto:sar...@fredhutch.org>>;
    bioc-devel@r-project.org <mailto:bioc-devel@r-project.org>
     >   > Subject: Re: [Bioc-devel] systemPipeR error - Error in
    NSBS(i, x, exact = exact,
     >   > upperBoundIsStrict = !allow.append) :
     >   >
     >   > Hi Thomas,
     >   >
     >   > On 10/25/2015 01:06 PM, Thomas Girke wrote:
     >   > > I fixed this in systemPipeR versions 1.4.3/1.5.3.
    <http://1.5.3.> The reason for this error
     >   > > was that the tx_type column contains only NA values when a
    txdb is
     >   > generated with
     >   > > makeTxDbFromUCSC(). Returning here something more
    meaningful may be
     >   > useful,
     >   > > such as the transcript type information available when a
    txdb is generated
     >   > > from a GFF.
     >   >
     >   > We've considered this and might do it at some point. The
    difficulty
     >   > though is that UCSC does not provide this information as part of
     >   > the track itself so we'll have to go grab it from some other
    table
     >   > in their huge db through many joins. In the mean time, I'll
    try to
     >   > clarify this in the documentation.
     >   >
     >
     > Or, you could pull it from BioMart

    That means you need to be able to map UCSC transcripts IDs (uc001aaa.3,
    uc010nxr.1, etc...) to Ensembl transcript IDs (ENST00000456328,
    ENST00000515242, etc...) with the usual complication that only a
    fraction of UCSC IDs are unambiguously mappable. I don't know if that's
    easier/better than grabbing the tx_type directly from UCSC.

    OTOH the closest thing to Ensembl transcript_biotype I see in the UCSC
    db is the "category" field in the kgTxInfo table. For example, for hg19,
    the information is here:


    
https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=kgTxInfo

    It's only one join away from the knownGene table (via the "name"
    column) so should not be too hard to get but it's not as rich as the
    Ensembl transcript_biotype (only 3 possible values at the moment:
    coding, nearCoding, noncoding).

    What do people think?

    H.

     >
     > ~Malcolm
     >
     >   > H.
     >   >
     >   > >
     >   > > Thanks,
     >   > >
     >   > > Thomas
     >   > >
     >   > > On Fri, Oct 23, 2015 at 12:49:09AM +0000, Thomas Girke wrote:
     >   > >> Thanks. Good to know. I have never tried this with an txdb
    instance
     >   > >> from makeTxDbFromUCSC(). Will fix this over the weekend.
     >   > >> Thomas
     >   > >>
     >   > >>
     >   > >>
     >   > >> On Thu, Oct 22, 2015 at 5:39 PM Arora, Sonali
    <sar...@fredhutch.org <mailto:sar...@fredhutch.org>>
     >   > wrote:
     >   > >>
     >   > >>
     >   > >> Hi Thomas,
     >   > >>
     >   > >> I get the following error when I try to obtain the feature
    types using
     >   > >> the function genFeatures()
     >   > >>
     >   > >>
     >   > >>> library(systemPipeR)
     >   > >>> library(GenomicFeatures)
     >   > >> Loading required package: AnnotationDbi
     >   > >>> txdb <- makeTxDbFromUCSC(genome = "hg19", tablename =
    "refGene")
     >   > >> Download the refGene table ... OK
     >   > >> Download the refLink table ... OK
     >   > >> Extract the 'transcripts' data frame ... OK
     >   > >> Extract the 'splicings' data frame ... OK
     >   > >> Download and preprocess the 'chrominfo' data frame ... OK
     >   > >> Prepare the 'metadata' data frame ... OK
     >   > >> Make the TxDb object ... OK
     >   > >> Warning message:
     >   > >> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
     >   > >> UCSC data anomaly in 359 transcript(s): the cds cumulative
    length is
     >   > >> not a multiple of 3 for transcripts 'NM_001037501'
    'NM_001277444'
     >   > >> 'NM_001037675' 'NM_001271872' 'NM_001170637' 'NM_001300952'
     >   > >> 'NM_015326' 'NM_017940' 'NM_001271870' 'NM_001143962'
     >   > 'NM_001305275'
     >   > >> 'NM_001146344' 'NM_001300891' 'NM_001010890' 'NM_001300891'
     >   > >> 'NM_001289974' 'NM_001291281' 'NM_001301371' 'NM_016178'
     >   > >> 'NM_001134939' 'NM_001080427' 'NM_001145710' 'NM_001291328'
     >   > >> 'NM_001271466' 'NM_001017915' 'NM_005541' 'NM_000348'
     >   > 'NM_001145051'
     >   > >> 'NM_001135649' 'NM_001128929' 'NM_001080423' 'NM_001144382'
     >   > >> 'NM_001291661' 'NM_002958' 'NM_001005861' 'NM_004636'
     >   > 'NM_001005914'
     >   > >> 'NM_001290060' 'NM_001290061' 'NM_001289930' 'NM_003715'
     >   > >> 'NM_001290049' 'NM_001286054' 'NM_001286053' 'NM_001286052'
     >   > >> 'NM_182524' 'NM_001075' 'NM_00 [... truncated]
     >   > >>> feat <- genFeatures(txdb, featuretype="all",
    reduce_ranges=TRUE,
     >   > >> upstream=1000,
     >   > >> + downstream=0, verbose=TRUE)
     >   > >> Error in NSBS(i, x, exact = exact, upperBoundIsStrict =
    !allow.append) :
     >   > >> subscript contains NAs
     >   > >>
     >   > >>
     >   > >> probably because -
     >   > >>
     >   > >> Browse[2]> tx
     >   > >> GRanges object with 54439 ranges and 3 metadata columns:
     >   > >> seqnames ranges strand | tx_name
     >   > >> <Rle> <IRanges> <Rle> | <character>
     >   > >> [1] chr1 [11874, 14409] + | NR_046018
     >   > >> [2] chr1 [30366, 30503] + | NR_036051
     >   > >> [3] chr1 [30366, 30503] + | NR_036266
     >   > >> [4] chr1 [30366, 30503] + | NR_036267
     >   > >> [5] chr1 [30366, 30503] + | NR_036268
     >   > >> ... ... ... ... ... ...
     >   > >> [54435] chrUn_gl000228 [112605, 114676] + | NM_001306068
     >   > >> [54436] chrUn_gl000228 [ 29339, 32226] - | NM_001005217
     >   > >> [54437] chrUn_gl000228 [ 29339, 32226] - | NM_001286820
     >   > >> [54438] chrUn_gl000241 [ 14739, 36767] - | NR_132315
     >   > >> [54439] chrUn_gl000241 [ 16025, 36957] - | NR_132320
     >   > >> gene_id tx_type
     >   > >> <CharacterList> <character>
     >   > >> [1] 100287102 <NA>
     >   > >> [2] 100302278 <NA>
     >   > >> [3] 100422831 <NA>
     >   > >> [4] 100422834 <NA>
     >   > >> [5] 100422919 <NA>
     >   > >> ... ... ...
     >   > >> [54435] 100288687 <NA>
     >   > >> [54436] 448831 <NA>
     >   > >> [54437] 448831 <NA>
     >   > >> [54438] 100289097 <NA>
     >   > >> [54439] 102723780 <NA>
     >   > >> -------
     >   > >> seqinfo: 93 sequences (1 circular) from hg19 genome
     >   > >> Browse[2]> unique(mcols(tx)$tx_type)
     >   > >> [1] NA
     >   > >> debug: tmp <- tx[mcols(tx)$tx_type == tx_type[i]]
     >   > >> Browse[2]>
     >   > >> Error in NSBS(i, x, exact = exact, upperBoundIsStrict =
    !allow.append) :
     >   > >> subscript contains NAs
     >   > >>
     >   > >>
     >   > >> Here is my sessionInfo
     >   > >>
     >   > >>> sessionInfo()
     >   > >> R Under development (unstable) (2015-10-15 r69519)
     >   > >> Platform: x86_64-pc-linux-gnu (64-bit)
     >   > >> Running under: Ubuntu 14.04.2 LTS
     >   > >>
     >   > >> locale:
     >   > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
     >   > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
     >   > >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
     >   > >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
     >   > >> [9] LC_ADDRESS=C LC_TELEPHONE=C
     >   > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
     >   > >>
     >   > >> attached base packages:
     >   > >> [1] parallel stats4 stats graphics grDevices utils datasets
     >   > >> [8] methods base
     >   > >>
     >   > >> other attached packages:
     >   > >> [1] GenomicFeatures_1.23.3 AnnotationDbi_1.33.0
     >   > >> [3] systemPipeR_1.5.1 RSQLite_1.0.0
     >   > >> [5] DBI_0.3.1 ShortRead_1.25.10
     >   > >> [7] GenomicAlignments_1.7.1 SummarizedExperiment_1.1.0
     >   > >> [9] Biobase_2.31.0 BiocParallel_1.5.0
     >   > >> [11] Rsamtools_1.23.0 Biostrings_2.39.0
     >   > >> [13] XVector_0.11.0 GenomicRanges_1.21.32
     >   > >> [15] GenomeInfoDb_1.7.1 IRanges_2.5.3
     >   > >> [17] S4Vectors_0.9.5 BiocGenerics_0.17.0
     >   > >>
     >   > >> loaded via a namespace (and not attached):
     >   > >> [1] Rcpp_0.12.1 lattice_0.20-33 GO.db_3.2.2
     >   > >> [4] digest_0.6.8 plyr_1.8.3 futile.options_1.0.0
     >   > >> [7] BatchJobs_1.6 ggplot2_1.0.1 zlibbioc_1.17.0
     >   > >> [10] annotate_1.49.0 Matrix_1.2-2 checkmate_1.6.2
     >   > >> [13] proto_0.3-10 GOstats_2.37.0 splines_3.3.0
     >   > >> [16] stringr_1.0.0 pheatmap_1.0.7 RCurl_1.95-4.7
     >   > >> [19] biomaRt_2.27.0 munsell_0.4.2 sendmailR_1.2-1
     >   > >> [22] rtracklayer_1.31.1 base64enc_0.1-3 BBmisc_1.9
     >   > >> [25] fail_1.3 edgeR_3.13.0 XML_3.98-1.3
     >   > >> [28] AnnotationForge_1.13.0 MASS_7.3-44 bitops_1.0-6
     >   > >> [31] grid_3.3.0 RBGL_1.47.0 xtable_1.7-4
     >   > >> [34] GSEABase_1.33.0 gtable_0.1.2 magrittr_1.5
     >   > >> [37] scales_0.3.0 graph_1.49.1 stringi_1.0-1
     >   > >> [40] hwriter_1.3.2 reshape2_1.4.1 genefilter_1.53.0
     >   > >> [43] limma_3.27.0 latticeExtra_0.6-26 futile.logger_1.4.1
     >   > >> [46] brew_1.0-6 rjson_0.2.15 lambda.r_1.1.7
     >   > >> [49] RColorBrewer_1.1-2 tools_3.3.0 Category_2.37.0
     >   > >> [52] survival_2.38-3 colorspace_1.2-6
     >   > >>
     >   > >>
     >   > >>
     >   > >>
     >   > >> --
     >   > >> Thanks and Regards,
     >   > >> Sonali
     >   > >>
     >   > >>
     >   > >>
     >   > >
     >   > > _______________________________________________
     >   > > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     >   > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >   > >
     >   >
     >   > --
     >   > Hervé Pagès
     >   >
     >   > Program in Computational Biology
     >   > Division of Public Health Sciences
     >   > Fred Hutchinson Cancer Research Center
     >   > 1100 Fairview Ave. N, M1-B514
     >   > P.O. Box 19024
     >   > Seattle, WA 98109-1024
     >   >
     >   > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
     >   > Phone:  (206) 667-5791
     >   > Fax:    (206) 667-1319
     >   >
     >   > _______________________________________________
     >   > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     >   > https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >

    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
    Phone:  (206) 667-5791
    Fax:    (206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to