Hello, I would like to ask you all for an advice in the following issue.
Last year I have started working with data from The Cancer Genome Atlas. During that work out team (https://github.com/orgs/RTCGA/people) have prepared some tools for downloading and integrating datasets from TCGA study and provided them in the R package called RTCGA <https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which is available on Bioconductor. Later on we were working on tools for visualizing and analyzing the most popular datasets from TCGA so we have prepared data packages with those datasets and submitted them to Bioconductor in 8 separate packages. You can read more about them here http://rtcga.github.io/RTCGA/ *I have a question about updating those data packages.* TCGA release datasets snapshots over time. In the RTCGA family of R packages there are available datasets from the release date 2015-11-01 but currently one can check that there was newer release 2016-01-28 > tail(RTCGA::checkTCGA('Dates')) [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01" "2016-01-28" I am wondering whether should we upload newer datasets to those data packages. We have found that there are great differences in results of data analysis depending on from which release date one has took datasets. More about this issue can be found here: http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata The current state of RTCGA family of R packages is listed below RTCGA.clinical <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html> - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0 - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.1.0 RTCGA.rnaseq <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html> - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0 - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0 RTCGA.mutations <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html> - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0 - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0 --------------------------------------------------- RTCGA.methylation <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html> - BiocRelease: NOT YET AVAILABLE - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1 RTCGA.CNV <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html> - BiocRelease: NOT YET AVAILABLE - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5 RTCGA.RPPA <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html> - BiocRelease: NOT YET AVAILABLE - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6 RTCGA.mRNA <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html> - BiocRelease: NOT YET AVAILABLE - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3 RTCGA.miRNASeq <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html> - BiocRelease: NOT YET AVAILABLE - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4 I think that having datasets from the newest snapshot date is vital for data analysis, but I wouldn't like to create situations in which 2 separate analysts use RTCGA.clinical and got different results because they used different data versions. That's why I have started versioning data packages with the number that corresponds to the release date. What do you think about such an issue? You can post advices here or on our issue list: https://github.com/RTCGA/RTCGA/issues Thanks for comments, Marcin [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel