Thanks Dario and Mike for looking into this.

In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
object by querying directly an Ensembl MySQL server. Only lightly
tested but so far seems to faster and more reliable than
makeTxDbFromBiomart():

  > library(GenomicFeatures)

> system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", server="useastdb.ensembl.org"))
  Fetch transcripts and genes from Ensembl ... OK
  Fetch exons and CDS from Ensembl ... OK
  Fetch chromosome names and lengths from Ensembl ...OK
  Gather the metadata ... OK
  Make the TxDb object ... OK
     user  system elapsed
   46.420   1.271  58.270

  > txdb
  TxDb object:
  # Db type: TxDb
  # Supporting package: GenomicFeatures
  # Data source: Ensembl
  # Organism: Homo sapiens
  # Ensembl release: 90
  # Ensembl database: homo_sapiens_core_90_38
  # MySQL server: useastdb.ensembl.org
  # transcript_nrow: 220144
  # exon_nrow: 757782
  # cds_nrow: 789614
  # Db created by: GenomicFeatures package from Bioconductor
  # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
  # GenomicFeatures version at creation time: 1.29.16
  # RSQLite version at creation time: 2.0
  # DBSCHEMAVERSION: 1.2

It's in GenomicFeatures 1.29.16.

Cheers,
H.


On 10/28/2017 03:32 PM, Mike Smith wrote:
My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead.  You can
check this with the code at
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e=


I wonder if this is related to BioMart's instability when you submit large
queries.  The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're applying
no filter at all.  Problems related to this have been reported before (
https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQMr80Q8K2um6sKTggHw0&e=)
 and I patched the devel version
of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.

Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.

all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)

dim(df3)
[1] 1309356       6

The number of returned results is consistent for even with the larger
number of attributes

identical(df3[,1], df4[,1])
[1] TRUE

And some of these transcripts are clearly not present in the first query:

length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,
"ensembl_transcript_id"] ))
[1] 28128

It could be that my batched code is doing something wrong, but I'm starting
to suspect that even the first query is dropping data silently!

Mike


On 28 October 2017 at 13:00, Dario Strbenac <dstr7...@uni.sydney.edu.au>
wrote:

Good day,

I stepped through the code until execution reached the end of postForm in
RCurl which is called by getBM and obtains the textual result from the
server. If I check the contents of write$value(), the example missing
transcript is not there.

Browse[3]> grep("ENST00000485971", write$value())
integer(0)

write$value is a weird function. It's prototype is function (collapse =
"", ...) but its body contains code such as

     if (is.null(collapse))
         return(txt)

I wonder where txt is created. It's not passed as an extra variable.

Browse[7]> print(list(...))
list()

Searching the R code reveals that txt is created as a global variable in
another function named dynCurlReader by the code statement txt <<-
character().

RCurl also uses functions that don't begin with a dot but are undocumented.

ans = encode(ans)
Browse[7]> ?encode
No documentation for ‘encode’ in specified packages and libraries

Anyway, the transcript ID is also missing from txt.

Browse[7]> grep("ENST00000485971", txt)
integer(0)

It's hard to know what the obfuscated code of RCurl is doing.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJDMeyF5e0R7MZWxNlxOxbXiH1zLGA&e=


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJDMeyF5e0R7MZWxNlxOxbXiH1zLGA&e=


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to