Good stuff, Max! Would also be nice to nail your 14 theses to a more permanent wall than the r-help mailing list ... not sure where that would be, though ... isn't someone supposed to be redesigning the r-project.org website? [I jest, I jest] More seriously, though, it might be worth linking to from the developer.r-project.org site as well as from some blurb in the header of the ML task view.
-steve On Wed, Jan 4, 2012 at 9:19 AM, Max Kuhn <mxk...@gmail.com> wrote: > Working on the caret package has exposed me to the wide variety of > approaches that different authors have taken to creating predictive > modeling functions (aka machine learning)(aka pattern recognition). > > I suspect that many package authors are neophyte R users and are > stumbling through the process of writing their first R package (or R > code). As such, they may not have been exposed to some of the informal > conventions that have evolved over time. Also, their package may be > intended to demonstrate their research and not for "production" > modeling. In any case, it might be a good idea to print up a few > points for consideration when creating a predictive modeling package. > I don't propose changes to existing code. > > Some of this is obvious and not limited to this class of modeling > packages. Many of these points are arguable, so please do so. > > If this seems useful, perhaps we could repost the final list to R-Help > to use as a checklist. > > Those of you who have used my code will probably realize that I am not > a grand architect of R packages =] I'd love to get feedback from those > of you with a broader perspective and better software engineering > skills than I (a low bar to step over). > > I have marked a few of these items with an OCD tag since I might be > taking it a bit too far. > > The list: > > (1) Extend the work of others. There is an amazing amount of unneeded > redundancy. There are plenty of times that users implement their own > version of a function because there is an missing feature, but a lot > of time is spent re-creating duplicate functions. For example, kernlab > has an excellent set of kernel functions that are really efficient and > have useful ancillary functions. People may not new aware of these > functions, but they are one RSiteSearch away. (Perhaps we could > nominate a few packages like kernlab that implement a specific tool > well) > > (2) When modeling a categorical outcome, use a factor as input (as > opposed to 0/1 indicators or integers). Factors are exactly the kind > of feature that separates R from other languages (I'm looking at you > SAS) and is a natural structure for this data type. > > corollary (2a): save the factor levels in the model object somewhere > > corollary (2b): return predicted classes as factors with the same > levels (and ordering of levels). > > (3) Implement a separate prediction function. Some packages only make > predictions when the model is built, so effectively the model cannot > be used at any point in the future. > > corollary (3a): use object-orientation (eg. predict.{class}) and not > some made-up function name "modelPredict()" for predicting new > samples. > > (4) If the method only accepts a specific type of input (eg. matrix or > data frame), please do the conversion whenever appropriate. > > (5) Provide a formula interface (eg. foo(y~x, data = dat)) and > non-formula interface (foo(x, y) to the function. Formula methods are > really inefficient at this time for large dimensional data but are > fantastically convenient. There are some good reasons to not use > formulas, such as functions that do not use a design matrix (eg. > cforest()) or need factors to be handled in a non-standard way (eg. > cubist()). > > (6) Don't require a test set when model building. > > (7) Control all written output during model-building time with a > verbose option. Resampling can make a mess out of things if > output/logging is always exposed. > > (8) Please use RSiteSearch to avoid name collisions between packages > (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor. > > (9) Allow the predict function to generate results from many different > sub-models simultaneously. For example, pls() can return predictions > across many values of ncomp. enet(), cubist(), blackboost() are other > examples. > > corollary (9a): [OCD] ensure the same object type for predictions. > There are occasions where predict() will return a vector or a matrix > depending on the context. I would argue that this is not optimal. > > (10) Use a limited vocabulary for options. For example, some predict() > functions have a "type" options to switch between predicted classes > and class probabilities. Values of "type" pertaining to class > probabilities range from "prob", "probability", "posterior", "raw", > "response", etc. I'll make a suggestion of "prob" as a possible > standard for this situation. > > (11) Make sure that class probabilities sum to one. Seriously. > > (12) If the model implicitly conducts feature selection, do not > require un-used predictors to be present in future data sets for > prediction. This may be a problem when the formula interface to models > is used, but it looks like many functions reference columns by > position and not name. > > (13) Packages that have their own cross-validation functions should > allow the users to pass in the specific folds/resamping indicators to > maintain consistency across similar functions in other packages. > > (14) [OCD] For binary classification models, model the probability of > the first level of a factor as the event of interest (again, for > consistency) Note that glm() does not do this but most others use the > first level. > > Thanks, > > Max > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel