On 3/10/19 2:36 PM, David Goldsmith wrote:
Hi! Newbie (self-)learning R using P. Dalgaard's "Intro Stats w/ R"; not
new to statistics (have had grad-level courses and work experience in
statistics) or vectorized programming syntax (have extensive experience
with MatLab, Python/NumPy, and IDL, and even a smidgen--a long time ago--of
experience w/ S-plus).
In exploring the use of is.na in the context of logical indexing, I've come
across the following puzzling-to-me result:
y; !is.na(y[1:3]); y[!is.na(y[1:3])]
[1] 0.3534253 -1.6731597 NA -0.2079209
[1] TRUE TRUE FALSE
[1] 0.3534253 -1.6731597 -0.2079209
As you can see, y is a four element vector, the third element of which is
NA; the next line gives what I would expect--T T F--because the first two
elements are not NA but the third element is. The third line is what
confuses me: why is the result not the two element vector consisting of
simply the first two elements of the vector (or, if vectorized indexing in
R is implemented to return a vector the same length as the logical index
vector, which appears to be the case, at least the first two elements and
then either NA or NaN in the third slot, where the logical indexing vector
is FALSE): why does the implementation "go looking" for an element whose
index in the "original" vector, 4, is larger than BOTH the largest index
specified in the inner-most subsetting index AND the size of the resulting
indexing vector? (Note: at first I didn't even understand why the result
wasn't simply
0.3534253 -1.6731597 NA
but then I realized that the third logical index being FALSE, there was no
reason for *any* element to be there; but if there is, due to some
overriding rule regarding the length of the result relative to the length
of the indexer, shouldn't it revert back to *something* that indicates the
"FALSE"ness of that indexing element?)
Thanks!
It happens because R is eco-concious and re-cycles. :-)
Try:
ok <- c(TRUE,TRUE,FALSE)
(1:4)[ok]
In general in R if there is an operation involving two vectors then
the shorter one gets recycled to provide sufficiently many entries to
match those of the longer vector.
This in the foregoing example the first entry of "ok" gets used again,
to make a length 4 vector to match up with 1:4. The result is the same
as (1:4)[c(TRUE,TRUE,FALSE,TRUE)].
If you did (1:7)[ok] you'd get the same result as that from
(1:7)[c(TRUE,TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)] i.e. "ok" gets
recycled 2 and 1/3 times.
Try 10*(1:3) + 1:4, 10*(1:3) + 1:5, 10*(1:3) + 1:6 .
Note that in the first two instances you get warnings, but in the third
you don't, since 6 is an integer multiple of 3.
Why aren't there warnings when logical indexing is used? I guess
because it would be annoying. Maybe.
Note that integer indices get recycled too, but the recycling is limited
so as not to produce redundancies. So
(1:4)[1:3] just (sensibly) gives
[1] 1 2 3
and *not*
[1] 1 2 3 1
Perhaps a bit subtle, but it gives what you'd actually *want* rather
than being pedantic about rules with a result that you wouldn't want.
cheers,
Rolf Turner
P.S. If you do
y[1:3][!is.na(y[1:3])]
i.e. if you're careful to match the length of the vector and the that of
the indices, you get what you initially expected.
R. T.
P^2.S. To the younger and wiser heads on this list: the help on "["
does not mention that the index vectors can be logical. I couldn't find
anything about logical indexing in the R help files. Is something
missing here, or am I just not looking in the right place?
R. T.
--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.