On Tue, 4 Oct 2022 16:46:03 +0200 Vincent van Hees <vincentvanh...@gmail.com> wrote:
> Dear all, > > I am looking for guidance (blog posts / books / people with > expertise) on how to split up an R package that has grown a lot in > complexity and size. To make it worthwhile, the split needs to ease > the maintenance and ongoing development. <SNIP> Here is some advice based on our experience in splitting the 'spatstat' package (over 170,000 lines of code, now split into 10 sub-packages, which took us about one person-year of work). See https://github.com/spatstat/spatstat. 1. Don't split your package unless you must. Splitting a package into sub-packages takes considerable effort. Maintaining a set of sub-packages requires much more effort than maintaining a single large package. We estimate, quasi-seriously, that the amount of effort required is O(n^2) where n is the number of sub-packages. :-) If you split a package, the CRAN servers will have less work, but almost everyone else --- developers, maintainers, CRAN team, users --- will have more work. You won't even reduce the number of emails from CRAN: the R package checker complains when a package is large, but it also complains when the package Depends on many sub-packages. 2. Design the split. Do not start tinkering until you have a plan. Print out a list of the functions (or the R files and help files) in your package, and think about a simple rule for splitting/grouping them. The rule for splitting the package needs to be simple and easy to apply for developers and users. For example in spatstat we separated 'exploratory' statistical summaries from 'parametric' statistical models because we can all remember what that means. (Note that *users* have to ‘apply’ the splitting rule in order to know where to find/look for a particular function after the package has been split.) A good splitting rule is something to do with the fundamental purpose of each function. The amount of trouble you will have after the split is related to the number of dependencies (between functions) that cross these boundaries, and the easiest way to minimise this is to group the functions according to their fundamental purpose. Give plenty of notice to the maintainers of packages that depend on your package. 3. Use 'make' and 'filepp' to implement the split. Leave the original source files where they are. Maintain the original source files as the master copies (i.e. bug fixes are fixed in these original files). For each sub-package, set up a new folder/directory with a Makefile that copies selected source files from the original package into the new directory. The Makefile can include rules that invoke 'filepp' to filter the source files. Arguments to the 'filepp' call can specify the names of variables that will then be substituted into the source code, or used as variables in 'if/then' directives to switch on/off blocks of code. This setup makes it much easier to keep track of the fate of each file, and to change your mind if needed. The "make" tool is extremely powerful and useful, and is ubiquitously applied by software developers. However its syntax is not perspicuous, and can be daunting until you become experienced. If you are not completely comfortable with "make" you might find the tutorials at https://makefiletutorial.com and https://cs.colby.edu/maxwell/courses/tutorials/maketutor to be helpful. For information on filepp see https://www-users.york.ac.uk/~dm26/filepp 4. Do the split offline. Develop the sub-packages on your own machine until they all pass the package checker. 5. Consider the sequence of steps to get the packages on CRAN. CRAN has no mechanism for submitting a set of packages at the same time. Each submission is checked individually, on several different servers, using several versions of R, using the packages that are installed on that server. Hence the submission of your new sub-packages must be carried out according to a carefully considered incremental process. Problems that can occur include: a. Incompatibility between your new submission and the packages currently on the particular server. b. Cycles (loops) in the dependence graph. The dependence between functions in the packages may include loops where A depends on B which depends on C which depends on A, etc. c. Hard crashes. Crashes can occur if you use compiled functions (e.g. C language) or if your package is byte-compiled. In either case, changes to the interface (argument sequence) of compiled or byte-compiled code in one sub-package can result in an error or hard crash when another sub-package tries to call a function using the wrong interface. There is no sure way to prevent these happening. The best defence is to use the version number dependency rules in the DESCRIPTION file (to prevent the use of incompatible packages), and to allow about a week for each submitted package to propagate through the CRAN testing network (to ensure that the latest versions are used). Despite this, you can expect to have correspondence with CRAN about such problems. Allow plenty of time between submitting successive sub-packages. Give plenty of notice to users and maintainers of dependent packages. Hope this helps. cheers, Rolf Turner (on behalf of Adrian Baddeley and Ege Rubak) P.S. I hope that this posting is not too late to be useful. The lateness is entirely the fault of Rolf Turner. R.T. -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel