[Rd] Documentation bug?

2023-02-14 Thread GILLIBERT, Andre via R-devel
Dead R developers,

In R-devel  2023-02-11 and older R versions, there is a note in the "lm 
{stats}" help page specifying that:
> Offsets specified by offset will not be included in predictions by 
> predict.lm, whereas 
> those specified by an offset term in the formula will be.

However, the source code as well as basic tests seem to show that both types of 
offset terms are always used in predictions.
a<-data.frame(off=1:4, outcome=4:1)
mod<-lm(data=a, outcome~1, offset=off)
coef(a) # intercept is zero
predict(mod) # returns 1:4, which uses offset
predict(mod, newdata=data.frame(off=c(3,2,5))) # returns c(3,2,5) which uses 
the new offset

When looking at the history of R source code, this note seems to exist from R 
1.0.0 while the source code of predict.lm already called 
eval(object$call$offset, newdata)
https://github.com/SurajGupta/r-source/blob/1.0.0/src/library/base/R/lm.R
https://github.com/SurajGupta/r-source/blob/1.0.0/src/library/base/man/lm.Rd

Version 0.99.0 did not contain the note, but already had the call to 
eval(object$call$offset, newdata)
https://github.com/SurajGupta/r-source/blob/0.99.0/src/library/base/man/lm.Rd
https://github.com/SurajGupta/r-source/blob/0.99.0/src/library/base/R/lm.R

The actual behavior of R seems to be sane to me, but unless I miss something, 
this looks like a documentation bug.
It seems to have bugged someone before:
https://stackoverflow.com/questions/71264495/why-is-predict-not-ignoring-my-offset-from-a-poisson-model-in-r-no-matter-how-i

Digging deeper in R history, it seems that this note was also found in "glm 
{stats}" in R 1.0.0 but was removed in R 1.4.1. Maybe somebody forgot to remove 
it in "lm {stats}" too.

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Inquiry about the behaviour of subsetting and names in matrices

2023-05-03 Thread GILLIBERT, Andre via R-devel

Karolis wrote:
> Hello,

> I have stumbled upon a few cases where the behaviour of naming and subsetting 
> in matrices seems unintuitive.
> All those look related so wanted to put everything in one message.


> 1. Why row/col selection by names with NAs is not allowed?

>   x <- setNames(1:10, letters[1:10])
>   X <- matrix(x, nrow=2, dimnames = list(letters[1:2], LETTERS[1:5]))

>   x[c(1, NA, 3)]   # vector: works and adds "NA"
>   x[c("a", NA, "c")]   # vector: works and adds "NA"
>   X[,c(1, NA, 3)]  # works and selects "NA" column
>   X[,c("A", NA, "C")]  # 

I would state the question the other way : why are NAs integer indices allowed?
In my experience, they are sometimes useful but they often delay the detection 
of bugs. However, due to backward compatibility, this feature cannot be 
removed. Adding this feature to character indices would worsen the problem.

I see another reason to keep the behavior as is currently : character indices 
are most often used with column names in contexts were they are unlikely to be 
NAs except as a consequence of a bug. In other words, I fear that the 
valid-use-case/bug ratio would be quite poor with this feature.

> 2. Should setting names() for a matrix be allowed?
>
>   names(X) <- paste0("e", 1:length(X))
>   X["e4"]  # works
>
>   # but any operation on a matrix drops the names
>   X <- X[,-1]  # all names are gone
>   X["e4"]  # 
>
>   Maybe names() should not be allowed on a matrix?

Setting names() on a matrix is a rarely used feature that has practically no 
positive and no negative consequences. I see no incentive to change the 
behavior and break existing code.

> 3. Should selection of non-existent dimension names really be an error?
>
>   x[22]   # works on a vector - gives "NA"
>   X[,22]  # 

This is very often a bug on vectors and should not have been allowed on vectors 
in the first place... But for backwards compatibility, it is hard to remove. 
Adding this unsafe feature to matrices is a poor idea in my opinion.

>   A potential useful use-case is matching a smaller matrix to a larger one:

This is a valid use-case, but in my opinion, it adds more problems than it 
solves.

> These also doesn't seem to be documented in '[', 'names', 'rownames’.

Indeed, the documentation of '[' seems to be unclear on indices out of range. 
It can be improved.

> Interested if there specific reasons for this behaviour, or could these 
> potentially be adjusted?

In my opinion adding these features would improve the consistency of R but 
would add more sources of bugs in an already unsafe language.

Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Inquiry about the behaviour of subsetting and names in matrices

2023-05-03 Thread GILLIBERT, Andre via R-devel

Karolis K wrote:

> This is more an inconsistency between vectors and matrices.

> In vectors both numeric and character sub-setting works with NAs.

> In matrices only numberic and not character sub-setting works with NAs.

> Potentially this in itself can also be a source of bugs, or, at least 
> surprises.


Indeed.


Karolis K wrote:

> My original impression was that R was “clever” about the usage of NAs by 
> design. i.e. when you choose an unknown object

> from a set of objects the result is an object, but nobody knows which - hence 
> NA. Is it really accepted now that such a

> decision was a mistake and lead to bugs in user code?


This makes sense but my personal opinion (I do not speak for R developers, as I 
am not an R developer at all) is that the R language is so "clever" that it 
often becomes unsafe.

Sometimes, this cleverness is handy for fast programming, such as NA 
propagation at many places. Other times, it causes more bugs than it helps, 
such as partial matching for the '$' operator. Indexation of column names in a 
matrix is probably not the place where NA propagation is the most useful, 
although it has its use cases. Consistency may be the main reason to add that 
feature, but I am not sure that this is a major incentive.


Of course, the opinion of R developers would be more useful than my own 
personal views.


--

Sincerely

André GILLIBERT


De : Karolis Koncevičius 
Envoyé : mercredi 3 mai 2023 11:08:28
À : GILLIBERT, Andre
Cc : r-devel@r-project.org
Objet : Re: [Rd] Inquiry about the behaviour of subsetting and names in matrices


ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de 
connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, 
transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance


Thank you for such a quick reply, here are some points that I think might have 
been missed:

I would state the question the other way : why are NAs integer indices allowed?
In my experience, they are sometimes useful but they often delay the detection 
of bugs. However, due to backward compatibility, this feature cannot be 
removed. Adding this feature to character indices would worsen the problem.

But please also note that character indices with NA are allowed for vectors. 
This is more an inconsistency between vectors and matrices. In vectors both 
numeric and character sub-setting works with NAs. In matrices only numberic and 
not character sub-setting works with NAs. Potentially this in itself can also 
be a source of bugs, or, at least surprises.

Setting names() on a matrix is a rarely used feature that has practically no 
positive and no negative consequences. I see no incentive to change the 
behavior and break existing code.

When writing this message I had the opposite opinion. That this 2nd point is 
one of the most bug-probe points of all 3. As I would assume most users setting 
names() on a matrix would only do it by accident.

In my opinion adding these features would improve the consistency of R but 
would add more sources of bugs in an already unsafe language.

I think this maybe is a crux of the thing.

My original impression was that R was “clever” about the usage of NAs by 
design. i.e. when you choose an unknown object from a set of objects the result 
is an object, but nobody knows which - hence NA. Is it really accepted now that 
such a decision was a mistake and lead to bugs in user code?

Kind regards,
Karolis K.

On May 3, 2023, at 11:15 AM, GILLIBERT, Andre  
wrote:


Karolis wrote:
Hello,

I have stumbled upon a few cases where the behaviour of naming and subsetting 
in matrices seems unintuitive.
All those look related so wanted to put everything in one message.


1. Why row/col selection by names with NAs is not allowed?

 x <- setNames(1:10, letters[1:10])
 X <- matrix(x, nrow=2, dimnames = list(letters[1:2], LETTERS[1:5]))

 x[c(1, NA, 3)]   # vector: works and adds "NA"
 x[c("a", NA, "c")]   # vector: works and adds "NA"
 X[,c(1, NA, 3)]  # works and selects "NA" column
 X[,c("A", NA, "C")]  # 

I would state the question the other way : why are NAs integer indices allowed?
In my experience, they are sometimes useful but they often delay the detection 
of bugs. However, due to backward compatibility, this feature cannot be 
removed. Adding this feature to character indices would worsen the problem.

I see another reason to keep the behavior as is currently : character indices 
are most often used with column names in contexts were they are unlikely to be 
NAs except as a consequence of a bug. In other words, I fear that the 
valid-use-case/bug ratio would be quite poor with this feature.

2. Should setting names() for a matrix be allowed?

 names(X) <- paste0("e", 1:length(X))
 X["e4"]  # works

 # but any operation on a matrix drops the names
 X <- X[,-1]  # all 

Re: [Rd] Time to revisit ifelse ?

2025-08-01 Thread GILLIBERT, Andre via R-devel
Martin Maechler  wrote:
> I don't mind putting together a minimal package with some prototypes, tests,
> comparisons, etc.  But perhaps we should aim for consensus on a few issues
> beforehand.  (Sorry if these have been discussed to death already elsewhere.
> In that case, links to relevant threads would be helpful ...)
>
>  1. Should the type and class attribute of the return value be exactly the
> type and class attribute of c(yes[0L], no[0L]), independent of 'test'?
> Or something else?
>
>  2. What should be the attributes of the return value (other than 
> 'class')?
>
> base::ifelse keeps attributes(test) if 'test' is atomic, which seems
> like desirable behaviour, though dplyr and data.table seem to think
> otherwise:

In my experience, base::ifelse keeping attributes of 'test' is useful for names.
It may also be useful for dimensions, but for other attributes, it may be a 
dangerous feature.
Otherwise, attributes of c(yes, no) should be mostly preserved in my opinion.

> 3. Should the new function be stricter and/or more verbose?  E.g., should
> it signal a condition if length(yes) or length(no) is not equal to 1
> nor length(test)?

To be consistent with base R, it should warn if length(yes), length(no) and 
length(test) are not divisors of the longest, otherwise silently repeat the 
three vectors to get the same sizes.
This would work consistently with mathematical operators such as test+yes+no.

In my personal experience, the truncation of 'yes' and 'no' to length(test) if 
the most dangerous feature of ifelse().

>  4. Should the most common case, in which neither 'yes' nor 'no' has a
> 'class' attribute, be handled in C?  The remaining cases might rely on
>method dispatch and thus require a separate "generic" implementation in
>  R.  How much faster/more efficient would the C implementation have to
>be to justify the cost (more maintenance for R-core, more obfuscation
>   for the average user)?

If the function is not much slower than today ifelse(), it is not worth 
rewriting in C in my opinion.

Thank you for an implementation!
A few examples of misbehaviors (in my opinion):

> ifelse2(c(a=TRUE), factor("a"), factor("b")) 
Error in as.character.factor(x) : malformed factor

> ifelse2(TRUE, factor(c("a","b")), factor(c("b","a")))
[1] a
Levels: a b

I would expect this one to output
[1] a b
Levels: a b

I tried to develop a function that behaves like mathematical operators (e.g. 
test+yes+no) for length & dimensions coercion rules.
Please, find the function and a few tests below:

ifelse2 <- function (test, yes, no) {
# forces evaluation of arguments in order
test
yes
no

if (is.atomic(test)) {
if (!is.logical(test))
storage.mode(test) <- "logical"
}
else test <- if (isS4(test)) methods::as(test, "logical") else 
as.logical(test)

ntest <- length(test)
nyes <- length(yes)
nno <- length(no)

nn <- c(ntest, nyes, nno)
  nans <- max(nn)

ans <- rep(c(yes[0L], no[0L]), length.out=nans)

# check dimension consistency for arrays
has.dim <- FALSE
if (length(dim(test)) | length(dim(yes)) | length(dim(no))) {
lparams <- list(test, yes, no)
ldims <- lapply(lparams, dim)
ldims <- ldims[!sapply(ldims, is.null)]
ldimnames <- lapply(lparams, dimnames)
ldimnames <- ldimnames[!sapply(ldimnames, is.null)]

rdim <- ldims[[1]]
rdimnames <- ldimnames[[1]]
for(d in ldims) {
if (!identical(d, rdim)) {
stop(gettext("non-conformable arrays"))
}
}
has.dim <- TRUE
}

if (any(nans %% nn)) {
warning(gettext("longer object length is not a multiple of 
shorter object length"))
}

if (ntest != nans) {test <- rep(test, length.out=nans)}
if (nyes != nans) {yes <- rep(yes, length.out=nans)}
if (nno != nans) {no <- rep(no, length.out=nans)}

idx <- which( test)
ans[idx] <- yes[idx]

idx <- which(!test)
ans[idx] <- no[idx]

if (has.dim) {
dim(ans) <- rdim
dimnames(ans) <- rdimnames
}

if (!is.null(names(test))) {
names(ans) <- names(test)
}

ans
}


ifelse2(c(alpha=TRUE,beta=TRUE,gamma=FALSE),factor(c("A","B","C","X")),factor(c("A","B","C","D")))
ifelse2(c(TRUE,FALSE), as.Date("2025-04-01"), c("2020-07-05", "2022-07-05"))
ifelse2(c(a=TRUE, b=FALSE,c=TRUE,d=TRUE), list(42), list(40,45))
ifelse2(rbind(alpha=c(a=TRUE, b=FALSE),beta=c(c=TRUE,d=FALSE)), list(1:10), 
list(2:20,3:30))
a=rbind(alpha=c(a=TRUE, b=FALSE),beta=c(TRUE,TRUE))
b=rbind(ALPHA=c(A=TRUE, B