Mike, On time classes specifically, the lubridate package with documentation Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. http://www.jstatsoft.org/v40/i03/.
solves many confusion problems. Does it handle the problems you are reporting? Rich On Thu, Nov 3, 2011 at 7:49 PM, Mike Williamson <this.is....@gmail.com>wrote: > Hi Joshua, > > Thank you for the input! > > I agree that it is non-trivial to solve the cases you & I have posed. > However, I would wholeheartedly support having an error spit back for any > function that does not explicitly support a class. In this case, if I > attempt to do sapply(x, class), and 'x' is of class "difftime", then I > should receive an error "sapply cannot function upon class 'difftime' ". > Why do I take this stance? There are at least 2 strong reasons: > > - Most importantly, an incorrect answer is far more dangerous than no > answer. E.g., if I ask "what is 3 + 3?", I would far prefer to receive > "I > don't know" than "5". The former lets me know I need to choose another > path, the latter mistakenly makes me think I have an answer, when I do > not, > and I continue with analyses on the assumption that answer is correct. > In > the case of dates, this happens often. E.g., is the numeric that is > returned from sapply, for instance, the # of seconds since 1970-01-01, or > the number of days since 1970-01-01. This depends upon how 'R' > internally > attempts to fix any incongruities. > - But also very significantly, an error will get me in the habit of > avoiding any marginalized class types. I keep thinking, for instance, > that > I can use the "Dates" class, since 'R' says that it supports them. But > if > I got into the habit of converting all dates into numerics myself > beforehand (maybe counting the number of seconds from 1970-01-01, since > that seems a magic date), then I would be guaranteed that a function will > either (a) cause an error (e.g., if I try a character function on it), or > (b) function properly. However, since I don't overtly receive errors > when > attempting to use dates (or difftimes, or factors, or whatever), I keep > using them, instead of relying solely upon the true & trusted classes. > - the trickiest here is really factors. Factors are, by most > accounts, considered a core class. In some cases, you can only use > factors. E.g., when you want some sort of ordinal categorical > variable. > Therefore, the fact that factors also barf similarly to other > classes like > difftime (albeit much more rarely), is especially dangerous. > > There are, of course, habits that we can create to make ourselves > better programmers, and I will recognize that I can improve. However, this > issue of functions generating "wrong" answers is such a *huge* problem with > 'R', and other languages are catching up to 'R' so quickly, as far as their > capability to handle higher level math, that this issue is making 'R' a > less desirable language to use, as time progresses. I don't mean to claim > that my opinion is the end-all-be-all, but I would like to hear others > chime in, whether this is a large concern, or whether there is a very small > minority of folks impacted by it. > > Regards, > Mike > > --- > XKCD <http://www.xkcd.com> > > > > On Thu, Nov 3, 2011 at 2:51 PM, Joshua Wiley <jwiley.ps...@gmail.com> > wrote: > > > Hi Mike, > > > > This isn't really an answer to your question, but perhaps will serve > > to continue discussion. I think that there are some fundamental > > issues when working special classes. As a thought example, suppose I > > wrote a class, "posreal", which inherits from the numeric class. It > > is only valid for positive, real numbers. I use it in a package, but > > do not develop methods for it. A user comes along and creates a > > vector, x that is a posreal. Then tries: mean(x * -3). Since I never > > bothered to write a special method for mean for my class, R falls back > > to the inherited numeric, but gives a value that is clearly not valid > > for posreal. What should happen? S3 methods do not really have > > validation, so in principle, one could write a function like: > > > > f <- function(x) { > > vclass <- class(x) > > res <- mean(x) > > class(res) <- vclass > > return(res) > > } > > > > which "retains" the appropriate class, but in name only. R core > > cannot possibly know or imagine all classes that may be written that > > inherit from more basic types but with possible special aspects and > > requirements. I think the inherited is considered to be more generic > > and that is returned. It is usually up to the user to ensure that the > > function (whose methods were not specific to that special class but > > the inherited) is valid for that class and can manually convert it > > back: > > > > res <- as.posreal(res) > > > > What about lapply and sapply? Neither are generic or have methods for > > difftime, and so do some unexpected/desirable things. Again, without > > methods defined for a particular class, they cannot know what is > > special or appropriate way to handle it, they use defaults which > > sometimes work but may give unexpected or undesirable results, but > > what else can be done? (okay, they could just throw an error) If a > > function is naive about a class, it does not seem right to operate on > > it using unknown methods and then pretend to be returning the same > > type of data. As it stands, they convert to a data type they know and > > return that. > > > > Now, you mention that for loops are slow in R, and this is true to a > > degree. However, the *apply functions are basically just internal > > loops, so they do not really save you (they are certainly not > > vectorized!), though they are more elegant than explicit loops IMO. > > One way to use them while retaining class would be like: > > > > sapply(seq_along(test), function(i) class(test[i])) > > > > this is less efficient then sapply(test, class), but the overhead > > drops considerably as the function does nontrivial calculations. > > Finally, I find the (relatively) new compiler package really shines at > > making functions that are just wrappers for for loops more efficient. > > Take a look at the examples from: > > > > require(compiler) > > ?cmpfun > > > > I am not familiar with numPy so I do not know how it handles new > > classes, but with some tweaks to my workflow, I do not find myself > > running into problems with how R handles them. I definitely > > appreciate your position because I have been there...as I became more > > familiar with R, classes, and methods, I find I work in a way that > > avoids passing objects to functions that do not know how to handle > > them properly. > > > > Cheers, > > > > Josh > > > > > > On Thu, Nov 3, 2011 at 11:08 AM, Mike Williamson <this.is....@gmail.com> > > wrote: > > > Hi All, > > > > > > I don't have a "I need help" question, so much as a query into any > > > update whether 'R' has made any progress with some of the core > functions > > > retaining classes. As an example, because it's one of the cases that > > most > > > egregiously impacts me & my work and keeps pushing me away from 'R' and > > > into other numerical languages (such as NumPy in python), I will use > > sapply > > > / lapply to demonstrate, but this behavior is ubiquitous throughout > 'R'. > > > > > > Let's say I have a class which is theoretically supported, but not > one > > > of the core "numeric" or "character" classes (and, to some degree, > > "factor" > > > classes). Many of the basic functions will convert my desired class > into > > > either numeric or character, so that my returned answer is gibberish. > > > > > > E.g.: > > > > > > test= as.difftime(c(1, 1, 8, 0.25, 8, 1.25), units= "days") ## create > a > > > small array of time differences > > > class(test) ## this will return the proper class, "difftime" > > > class(test[1] ) ## this will also return the proper class, "difftime" > > > sapply(test, class) ## this will return *numerics* for all of the > > classes. > > > Ack!! > > > > > > In the example I give above, the impact might seem small, but the > > > implications are *huge*. This means that I am, in effect, not allowed > to > > > use *any* of the vectoring functions in 'R', which avoid performing > loops > > > thereby speeding up process time extraordinarily. Many can sympathize > > that > > > 'R' is ridiculously slow with "for" loops, compared to other languages. > > > But that's theoretically OK, a good statistician or data analyst > should > > be > > > able to work comfortably with matrices and vectors. However, *'R' > cannot > > > work comfortably* with matrices or vectors, *unless* they are using the > > > numeric or character classes. Many of the classes suffer the problem I > > > just described, although I only used "difftime" in the example. > Factors > > > seem a bit more "comfortable", and can be handled most of the time, but > > not > > > as well as numerics, and at times functions working on factors can > return > > > the numerical representation of the factor instead of the original > > factor. > > > > > > Is there any progress in guaranteeing that all core functions either > > > (a) ideally return exactly the classes, and hierarchy of classes, that > > they > > > received (e.g., a list of data frames with difftimes & dates & > characters > > > would return a list of data frames with difftimes & dates & > characters), > > or > > > (b) barring that, the function should at least error out with a clear > > error > > > explaining that sapply, for example, cannot vectorize on the class > being > > > used? Returning incorrect answers is far worse than returning an > error, > > > from a perspective of stability. > > > > > > This is, by far, the largest Achilles' heel to 'R'. Personally, as > my > > > career advances and I work on more technical things, I am finding that > I > > > have to leave 'R' by the wayside and use other languages for robust > > > numerical calculations and programming. This saddens me, because there > > are > > > so many wonderful packages developed by the community. The example > above > > > came up because I am using the "forecast" library to great effect in > > > predicting how long our product cycle time will be. However, I spend > > much > > > of my time fighting all these class & typing bugs in 'R' (and we have > to > > > start recognizing that they are bugs, otherwise they may never get > > > resolved), such that many of the improvements in my productivity due to > > all > > > the wonderful computational packages are entirely offset by the time > > > I spend fighting this issue of poor classes. > > > > > > Thanks & Regards! > > > Mike > > > > > > --- > > > XKCD <http://www.xkcd.com> > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > r-h...@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > > > -- > > Joshua Wiley > > Ph.D. Student, Health Psychology > > Programmer Analyst II, ATS Statistical Consulting Group > > University of California, Los Angeles > > https://joshuawiley.com/ > > > > [[alternative HTML version deleted]] > > ______________________________________________ > r-h...@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel