Hi Uwe, Thanks for this information, and it makes sense to me. Is there a preferred way to cache the data locally?
None of the ways that I can think to cache the data sound particularly good, and I wonder if I'm missing something. The ideas that occur to me are: 1. Download them into the package directory `path.package("datapkg")`, but that would require an action to be performed on package installation, and I'm unaware of any way to trigger an action on installation. 2. Have a user-specified cache directory (e.g. `options("datapkg_cache"="/my/cache/location")`), but that would require interaction with every use. (Not horrible, but it will likely significantly increase the number of user issues with the package.) 3. Have a user-specified cache directory like #2, but have it default to somewhere in their home directory like `file.path(Sys.getenv("HOME"), "datapkg_cache")` if they have not set the option. To me #3 sounds best, but I'd like to be sure that I'm not missing something. Thanks, Bill -----Original Message----- From: Uwe Ligges <lig...@statistik.tu-dortmund.de> Sent: Sunday, December 15, 2019 11:54 AM To: b...@denney.ws; r-package-devel@r-project.org Subject: Re: [R-pkg-devel] Large Data Package CRAN Preferences Ideally yoiu wpuld host the data elsewhere and submit a CRAN package that allows users to easily get/merge/aggregate the data. Best, Uwe Ligges On 12.12.2019 20:55, b...@denney.ws wrote: > Hello, > > > > I have two questions about creating data packages for data that will > be updated and in total are >5 MB in size. > > > > The first question is: > > > > In the CRAN policy, it indicates that packages should be ?5 MB in size > in general. Within a package that I'm working on, I need access to > data that are updated approximately quarterly, including the > historical datasets (specifically, these are the SDTM and CDASH > terminologies in https://evs.nci.nih.gov/ftp1/CDISC/SDTM/Archive/). > > > > Current individual data updates are approximately 1 MB when > individually saved as .RDS, and the total current set is about 20 MB. > I think that the preferred way to generate these packages since there > will be future updates is to generate one data package for each update > and then have an umbrella package that will depend on each of the individual > data update packages. > That seems like it will minimize space requirements on CRAN since old > data will probably never need to be updated (though I will need to access it). > > > > Is that an accurate summary of the best practice for creating these as > a data package? > > > > And a second question is: > > > > Assuming the best practice is the one I described above, the typical > need will be to combine the individual historical datasets for local > use. An initial test of the time to combine the data indicates that > it would take about 1 minute to do, but after combination, the result > could be loaded faster. I'd like to store the combined dataset > locally with the umbrella package. I believe that it is considered > poor form to write within the library location for a package except during > installation. > > > > What is the best practice for caching the resulting large dataset > which is locally-generated? > > > > Thanks, > > > > Bill > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel > ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel