Re: [R] Variable passed to function not used in function in select=... in subset

Duncan Murdoch Tue, 11 Nov 2008 06:38:50 -0800

On 11/11/2008 8:53 AM, hadley wickham wrote:

On Mon, Nov 10, 2008 at 1:04 PM, Wacek Kusnierczyk
<[EMAIL PROTECTED]> wrote:

pardon me, but does this address in any way the legitimate complaint of
the rightfully confused user?


consider the following:

d = data.frame(a=1, b=2)
a = c("a", "b")
z = a
# that is, both a and z are c("a", "b")

subset(d, select=z)
# gives two columns, since z is a two element vector whose elements are
valid column names

subset(d, select=a)
# gives one column, since 'a' (but not a) is a valid column name

subset(d, select=c(a,b))
# gives two columns


this is certainly what the authors intended, and they may have good
grounds for this smart design.  but this must break the expectation of a
naive (r-naive, for that matter) user, who may otherwise have excellent
experience in using a functional programming language, e.g., scheme.
(especially scheme, where symbols and expressions are first-class
objects, yet the distinction between a symbol or an expression and their
referent is made painfully clear, perhaps except for when one hacks with
macros.)

the examples above illustrate the notorious problem with r that one can
never tell whether 'a' means "the value referred to with the identifier
'a'" or "the symbol 'a'", unless one gets ugly surprises and is forced
to study the documentation.  and even then one may not get a clear answer.


I agree, with some caveats.  There are basically two uses of R: as a
interactive data analysis package and as a statistical programming
language.  These uses come into conflict: in the interactive
environment, you want to minimise typing so that you can be as speedy
as possible.  It doesn't matter if R occasionally makes a wrong guess
when you have specified something implicitly, because you can fix it
on the fly.  When you are programming, you care less about saving
typing and more about reproducibility.  You want to be explicit so
your function is robust to widely varying inputs, even if it means you
have to type a lot more.  You see this tension in quite a few places:

 * drop = T
 * functions that return different types of output (e.g. sapply)
depending on input parameters
 * partial matching of argument names
 * using unevaluated expressions instead of strings (e.g. library, subset, ...)

These are all things that are helpful for interactive use, but make
life as a programmer more difficult.  I find the last one particularly
frustrating because it means it is very difficult to program with some
functions (i.e subset) without resorting to complex quoting,
substituting and evaluating tricks.  I have tried to steer away from
this technique in my packages, and where it's just too convenient for
interactive use, insulating the deparsing into special functions that
the data analyst must use (e.g. aes() in ggplot, and .() in plyr),
along with providing alternatives for the programmer.

I don't understand why you're getting so much push-back on this issue.
 R is a fantastic language, but it has some genuinely nasty corners.
In my opinion, this is one of them.

I think your analysis is correct, that the goals of casual use andprogramming are inconsistent. But in general I think there's alwaysgoing to be support for providing alternative ways that areprogrammer-safe.

For instance, library( foo, character.only=TRUE) says that foo is acharacter vector, not the name of a package. I don't know of anythingthat subset() provides that is not available in other ways (I think ofit as purely a convenience function, and my first piece of advice toKarl was not to use it). However, if there really is something there,then it would be worthwhile pointing that out, and either modifyingsubset() to make it safe, or providing an alternative function.

I think this tension is a fundamental part of the character of S and R.But it is also fundamental to R that there are QC tests that apply tocode in packages: so writing new tests that detect dangerous usage(e.g. to disallow partial name matching) would be another way to improvereliability. Writing a test for misuse of drop=TRUE seems quite hard,but there are probably ways a debugger could do it: e.g. to tag theinvocation as to whether any indices were dropped on the first call, andthen warn if the result isn't the same on every subsequent call).

Conceivably Karl's problem could be detected in the same way: tag eachname in the expression as to whether it was found in the data frame orsome other environment, and then warn if that tag ever changes. Ormaybe the test should just warn that subset() is a convenience function,not meant for programming.


Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Variable passed to function not used in function in select=... in subset

Reply via email to