Re: [R] two questions for R beginners

Duncan Murdoch Tue, 02 Mar 2010 04:30:14 -0800

John Sorkin wrote:

Please take what follows not as an ad hominem statement, but rather as an 
attempt to improve what is already an excellent program, that has been built as 
a result of many, many hours of dedicated work by many, many unpaid, unsung 
volunteers.
It troubles me a bit that when a confusing aspect of R is pointed out the response is not to try to improve the language so as to avoid the confusion, but rather to state that the confusion is inherent in the language. I understand that to make changes that would avoid the confusing aspect of the language that has been discussed in this thread would take time and effort by an R wizard (which I am not), time and effort that would not be compensated in the traditional sense. This does not mean that we should not acknowledge the confusion. If we what R to be the de facto lingua franca of statistical analysis doesn't it make sense to strive for syntax that is as straight forward and consistent as possible?

I think you've misunderstood the argument. It would not be hard to makethe suggested change. I don't object to it because it would be too muchwork, I object to it because I think it is not an improvement.Dataframes and matrices are different, and there is no way to avoid thatfact.

The arguments in favour of the change seem to be these:

- Dataframes and matrices are similar in some respects, so they shouldbe similar in more.

In fact, I believe that the source of confusion is the fact that theare similar, so this would not improve things. People would still beconfused by the differences, which are unavoidable.


- Using $ to extract a column of a matrix would be convenient.

I agree, it saves 4 keystrokes to type X$column instead ofX[,"column"]. But I think it increases confusion, so the savings arenot worthwhile. For example, the col2rgb function returns a matrix withrows named red, green and blue. But under your proposal, I'd still needto use X["red",] to extract the red component, because columns arecomponents, but rows are not. You are complaining that the lack of $for matrices is an unnecessary asymmetry, and unnecessary asymmetriesare confusing. But your proposal introduces a new one!


 - Some functions return matrices when I expect a dataframe, or vice versa.

That will continue to be true regardless of whether the proposed changeis made. You need to read the documentation. If it is unclear, itshould be improved, the language shouldn't be changed so that sloppydocumentation is accurate.


 - You suggested this so anyone who disagrees must be lazy.

Which really is an ad hominem argument, despite your disclaimer. Ithink you should respect the fact that there are people who disagreewith the value of your suggestion. (Which is also an ad hominemattack, but isn't central to my argument.)


Duncan Murdoch

Again, please understand that my comment is made with deepest respect for the 
many people who have unselfishly contributed to the R project. Many thanks to 
each and every one of you.

John
Karl Ove Hufthammer <[email protected]> 3/2/2010 4:00 AM >>>
On Mon, 01 Mar 2010 10:00:07 -0500 Duncan Murdoch <[email protected]>wrote:
Suppose X is a dataframe or a matrix. What would you expect to get fromX[1]? What about as.vector(X), or as.numeric(X)?
All this of course depends on type of object one is speaking of. Thereare plenty of surprises available, and it's best to use the most logicalway of extracting. E.g., to extract the top-left element of a 2Dstructure (data frame or matrix), use 'X[1,1]'.
Luckily, R provides some shortcuts. For example, you can write 'X[2,3]'on a data frame, just as if it was a matrix, even though the underlyingstructure is completely different. (This doesn't work on a normal list;there you have to type the whole 'X[[2]][3]'.)
The behaviour of the 'as.' functions may sometimes be surprising, atleast for me. For example, 'as.data.frame' on a named vector gives asingle-column data frame, instead of a single-row data frame.
(I'm not sure what's the recommended way of converting a named vector torow data frame, but 'as.data.frame(t(X))' works, even though both 'X'and 't(X)' looks like a row of numbers.)
The point is that a dataframe is a list, and a matrix isn't. If usersdon't understand that, then they'll be confused somewhere. Makingmatrices more list-like in one respect will just move the confusionelsewhere. The solution is to understand the difference.
My main problem is not understanding the difference, which is easy, butknowing which type of I have when I get the output a function in apackage. If I know the object is a named vector or a matrix with columnnames, it's easy enough to type 'X[,"colname"]', and if it's a dataframe one may use the shortcut 'X$colname'.
Usually, it *is* documented what the return value of a function is, butjust looking at the output is much faster, and *usually* gives thecorrect answer.
For example, 'mean' applied on a data frame gives a named vector, not adata frame, which is somewhat surprising (given that the columns of adata frame may be of different types, while the elements of a vector maynot). (And yes, I know that it's *documented* that it returns a namedvector.) On the other hand, perhaps it is surprising that 'mean' workson data frames at all. :-)


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] two questions for R beginners

Reply via email to