Since R is object-oriented, data frame set operations should be the natural
operations for their class.  There are, I suppose, two natural ways: the
column-wise (variable-wise) and the row-wise (observation-wise) one.  The
row-wise one seems more natural and more useful to me.

The current implementation is column-wise, though it is inconsistent in its
return class (the man page defines return modes, but is silent on return
classes):

class(union(df1,df2))
[1] "list"
> class(intersect(df1,df2))
[1] "data.frame"
> class(setdiff(df1,df2))
[1] "data.frame"

Unlike other cases, I don't think this inconsistency brings any user
convenience (though it may reflect programmer convenience).

The column-wise interpretation makes sense in cases where variables with the
same vector value (ignoring the variable name) can be considered redundant.
I suppose there are cases where that could be useful, though it does seem
hazardous.

The row-wise interpretation makes sense in cases where observations with the
same values for all variables can be considered redundant.  That seems to me
a much more useful interpretation.  The union, intersection, and set
difference of two sets of observations would seem to all be highly useful.

              -s

On Sat, May 30, 2009 at 10:21 AM, G. Jay Kerns <gke...@ysu.edu> wrote:

> On Sat, May 30, 2009 at 8:50 AM, Stavros Macrakis <macra...@alum.mit.edu>
> wrote:
> > It seems to me that, abstractly, a dataframe is just as
> > straightforwardly a sequence of tuples/observations as a vector is a
> > sequence of scalars. R's convention is that a 1-vector represents a
> > scalar, and similarly, a 1-dataframe can represent a tuple (though it
> > can also be represented as a list). Of course, a dataframe can *also*
> > be interpreted as a list of vectors.
> >
> > Just as a sequence of scalars can be interpreted as a set of scalars
> > by the order- and repetition-ignoring homomophism, so can a sequence
> > of tuples. It seems to me natural that set operations should follow
> > that interpretation.
> >
> >          -s
>
>
> After a good night's sleep, the documentation says clearly that
> setdiff() operates on two vectors (of the same mode), so my message
> would be an example of "garbage in, garbage out".
>
> It would be nice if there were an error thrown, but surely there are
> more mission critical problems than this one.
>
> Thanks anyway.
> Jay
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to