On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
Hello,
I would like to ask you all for an advice in the following issue.
Last year I have started working with data from The Cancer Genome Atlas.
During that work out team (https://github.com/orgs/RTCGA/people) have
prepared some tools for downloading and integrating datasets from TCGA
study and provided them in the R package called RTCGA
<https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which is
available on Bioconductor.
Later on we were working on tools for visualizing and analyzing the most
popular datasets from TCGA so we have prepared data packages with those
datasets and submitted them to Bioconductor in 8 separate packages. You can
read more about them here http://rtcga.github.io/RTCGA/
*I have a question about updating those data packages.* TCGA release
datasets snapshots over time. In the RTCGA family of R packages there are
available datasets from the release date 2015-11-01 but currently one can
check that there was newer release 2016-01-28
tail(RTCGA::checkTCGA('Dates'))
[1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
"2016-01-28"
I am wondering whether should we upload newer datasets to those data
packages. We have found that there are great differences in results of data
analysis depending on from which release date one has took datasets. More
about this issue can be found here:
http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
The current state of RTCGA family of R packages is listed below
RTCGA.clinical
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html>
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.1.0
RTCGA.rnaseq
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html>
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
RTCGA.mutations
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html>
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
---------------------------------------------------
RTCGA.methylation
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html>
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
RTCGA.CNV
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html>
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
RTCGA.RPPA
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html>
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
RTCGA.mRNA
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html>
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
RTCGA.miRNASeq
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html>
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
I think that having datasets from the newest snapshot date is vital for
data analysis, but I wouldn't like to create situations in which 2 separate
analysts use RTCGA.clinical and got different results because they used
different data versions. That's why I have started versioning data packages
with the number that corresponds to the release date.
This isn't very helpful. There is only ever one version of
'RTCGA.clinical' available per Bioc version, so whether its version is
20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.
Probably you want to include the TCGA release in the package _name_,
'RTCGA.clinical.20151101'. Probably you want to have multiple versions
available at any one time.
I don't think the experiment data archive is the best solution for
distributing large collections of curated data. It places a burden on
our mirrors to sync the repository and on the svn repository to store
it. The packages are built twice weekly even though the data is very
static and in your case based on unchanging base R data structures. The
data are not very 'granular', even though you've done a good job of
making the individual data sets accessible, so a user interested in
ovarian cancers, say, would need to download all data anyway.
Instead I think that these should be ExperimentHub resources. How to add
resources is described in the vignette to the companion package
ExperimentHubData
http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html
The data would be stored in Amazon S3 so globally accessible; it would
not be under version control. The ExperimentHub / AnnotationHub cache
would manage local versions, rather than R's package system.
ExperimentHub will be back in active development, including addition of
new resources, immediately after our next release, May 4, so the timing
is fairly good.
I think it is also worth while to discuss how you have chosen to
represent each of the data types, for instance the RNAseq data as a
samples x genes data.frame whereas the Bioconductor convention would
store it primarily as a genes x sample matrix embedded in a
SummarizedExperiment (or at least make it available to the user in that
form; there are definitely advantages to keeping the serialized instance
as simple as possible).
Martin Morgan
Biocondcutor
What do you think about such an issue? You can post advices here or on our
issue list: https://github.com/RTCGA/RTCGA/issues
Thanks for comments,
Marcin
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), or the employee or
agent responsible for the delivery of this message to the intended
recipient(s), you are hereby notified that any disclosure, copying,
distribution, or use of this email message is prohibited. If you have received
this message in error, please notify the sender immediately by e-mail and
delete this email message from your computer. Thank you.
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel