Re: [R-pkg-devel] Large Data Package CRAN Preferences

bill Sun, 15 Dec 2019 09:28:10 -0800

Hi Uwe,

Thanks for this information, and it makes sense to me.  Is there a preferred 
way to cache the data locally?


None of the ways that I can think to cache the data sound particularly good, 
and I wonder if I'm missing something.  The ideas that occur to me are:

1. Download them into the package directory `path.package("datapkg")`, but that 
would require an action to be performed on package installation, and I'm 
unaware of any way to trigger an action on installation.
2. Have a user-specified cache directory (e.g. 
`options("datapkg_cache"="/my/cache/location")`), but that would require 
interaction with every use.  (Not horrible, but it will likely significantly 
increase the number of user issues with the package.)
3. Have a user-specified cache directory like #2, but have it default to 
somewhere in their home directory like `file.path(Sys.getenv("HOME"), 
"datapkg_cache")` if they have not set the option.

To me #3 sounds best, but I'd like to be sure that I'm not missing something.

Thanks,

Bill

-----Original Message-----
From: Uwe Ligges <lig...@statistik.tu-dortmund.de> 
Sent: Sunday, December 15, 2019 11:54 AM
To: b...@denney.ws; r-package-devel@r-project.org
Subject: Re: [R-pkg-devel] Large Data Package CRAN Preferences

Ideally yoiu wpuld host the data elsewhere and submit a CRAN package that 
allows users to easily get/merge/aggregate the data.

Best,
Uwe Ligges



On 12.12.2019 20:55, b...@denney.ws wrote:
> Hello,
> 
>   
> 
> I have two questions about creating data packages for data that will 
> be updated and in total are >5 MB in size.
> 
>   
> 
> The first question is:
> 
>   
> 
> In the CRAN policy, it indicates that packages should be ?5 MB in size 
> in general.  Within a package that I'm working on, I need access to 
> data that are updated approximately quarterly, including the 
> historical datasets (specifically, these are the SDTM and CDASH 
> terminologies in https://evs.nci.nih.gov/ftp1/CDISC/SDTM/Archive/).
> 
>   
> 
> Current individual data updates are approximately 1 MB when 
> individually saved as .RDS, and the total current set is about 20 MB.  
> I think that the preferred way to generate these packages since there 
> will be future updates is to generate one data package for each update 
> and then have an umbrella package that will depend on each of the individual 
> data update packages.
> That seems like it will minimize space requirements on CRAN since old 
> data will probably never need to be updated (though I will need to access it).
> 
>   
> 
> Is that an accurate summary of the best practice for creating these as 
> a data package?
> 
>   
> 
> And a second question is:
> 
>   
> 
> Assuming the best practice is the one I described above, the typical 
> need will be to combine the individual historical datasets for local 
> use.  An initial test of the time to combine the data indicates that 
> it would take about 1 minute to do, but after combination, the result 
> could be loaded faster.  I'd like to store the combined dataset 
> locally with the umbrella package.  I believe that it is considered 
> poor form to write within the library location for a package except during 
> installation.
> 
>   
> 
> What is the best practice for caching the resulting large dataset 
> which is locally-generated?
> 
>   
> 
> Thanks,
> 
>   
> 
> Bill
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-package-devel@r-project.org mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 

______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] Large Data Package CRAN Preferences

Reply via email to