On 4/28/21 5:22 PM, Martin Maechler wrote:
Toby Hocking
on Wed, 28 Apr 2021 07:21:05 -0700 writes:
> Hi Tomas, thanks for the thoughtful reply. That makes sense about
the
> problems with C locale on windows. Actually I did not choose to
use C
> locale, but instead it was invoked automatically during a package
check.
> To be clear, I do NOT have a file with that name, but I do want
file.exists
> to return a reasonable value, FALSE (with no error). If that
behavior is
> unspecified, then should I use something like
tryCatch(file.exists(x),
> error=function(e)FALSE) instead of assuming that file.exists will
always
> return a logical vector without error? For my particular
application that
> work-around should probably be sufficient, but one may imagine a
situation
> where you want to do
> x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> \360\237\247\222\360\237\217\274\n|
\360\237\247\222\360\237\217\275\n|
> \360\237\247\222\360\237\217\276\n|
\360\237\247\222\360\237\217\277\n"
> Encoding(x) <- "unknown"
> Sys.setlocale(locale="C")
> f <- tempfile()
> cat("", file = f)
> two <- c(x, f)
> file.exists(two)
> and in that case the correct response from R, in my opinion,
would be
> c(FALSE, TRUE) -- not an error.
> Toby
Indeed, thanks a lot to Tomas!
# A remark
We *could* -- and according to my taste should -- try to have
file.exists()
return a logical vector in almost all cases, namely, e.g., still give an
error for file.exists(pi) :
Notably if `c(...)` {for the `...` arguments of file.exists() }
is a character vector, always return a logical vector of the same
length, *and* we could notably make use of the fact that R's
logical type is not binary but ternary, and hence that return
value could contain values from {TRUE, NA, FALSE} and interpret NA
as "don't know" in all cases where the corresponding string in
the input had an Encoding(.) that was "fishy" in some sense
given the "context" (OS, locale, OS_version, ICU-presence, ...).
In particular, when the underlying code sees encoding-translation issues
for a string, NA would be returned instead of an error.
Yes, I agree with Toby and you that there is benefit in allowing
per-element, vectorized use of file.exists(), and well it is the case
now, we just fall back to FALSE. NA might be be better in case of error
that prevents the function from deciding whether the file exists or not
(e.g. an invalid name in form that make is clear such file cannot exist
might be a different case...).
But, the only way to get a translation error is by passing a string to
file.exists() which is invalid in its declared encoding (or which is in
"C" encoding). I would hope that we could get to the point where such
situation is prevented (we only allow creation of strings that can be
translated to Unicode). If we get there, the example would fail with
error (yet, right, before getting to file.exists()).
My point that I would not write tests of this behavior stands. One
should not use such file names, and after the change Toby reported from
ERROR to FALSE, Martin's proposal would change to NA, mine eventually to
ERROR, etc. So it is best for now to leave it unspecified and not
trigger it, I think.
Tomas
Martin
> On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera <
tomas.kalib...@gmail.com>
> wrote:
>> Hi Toby,
>>
>> a defensive, portable approach would be to use only file names
regarded
>> portable by POSIX, so characters including ASCII letters, digits,
>> underscore, dot, hyphen (but hyphen should not be the first
character).
>> That would always work on all systems and this is what I would
use.
>>
>> Individual operating systems and file systems and their
configurations
>> differ in which additional characters they support and how. On
some,
>> file names are just sequences of bytes, on some, they have to be
valid
>> strings in certain encoding (and then with certain exceptions).
>>
>> On Windows, file names are at the lowest level in UTF-16LE
encoding (and
>> admitting unpaired surrogates for historical reasons). R stores
strings
>> in other encodings (UTF-8, native, Latin-1), so file names have
to be
>> translated to/from UTF-16LE, either directly by R or by Windows.
>>
>> But, there is no way to convert (non-ASCII) strings in "C"
encoding to
>> UTF16-LE, so the examples cannot be made to work on Windows.
>>
>> When the translation is left on Windows, it assumes the
non-UTF-16LE
>> strings are in the Active Code Page encoding (shown as "system
encoding"
>> in sessionInfo() in R, Latin-1 in your example) instead of the
current C
>> library encoding ("C" in your example). So, file names coming
from
>> Windows will be either the bytes of their UTF-16LE
representation or the
>> bytes of their Latin-1 representation, but which one is subject
to the
>> implementation details, so the result is really unusable.
>>
>> I would say using "C" as encoding in R is not a good idea, and
>> particularly not on Windows.
>>
>> I would say that what happens with such file names in "C"
encoding is
>> unspecified behavior, which is subject to change at any time
without
>> notice, and that both the R 4.0.5 and R-devel behavior you are
observing
>> are acceptable. I don't think it should be mentioned in the NEWS.
>> Personally, I would prefer some stricter checks of strings
validity and
>> perhaps disallowing the "C" encoding in R, so yet another
behavior where
>> it would be clearer that this cannot really work, but that would
require
>> more thought and effort.
>>
>> Best
>> Tomas
>>
>>
>> On 4/27/21 9:53 PM, Toby Hocking wrote:
>>
>> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
fixed in
>> > R-devel already. I checked on
>> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and
there is no
>> > mention of these changes, so I'm wondering if they are
intentional? If
>> so,
>> > could someone please add a mention of the bugfix in the NEWS?
>> >
>> > The problem involves file.exists, on windows, when a
long/strange input
>> > file name Encoding is unknown, in C locale. I expected that
FALSE should
>> be
>> > returned (and it is on R-devel), but I got an error in
R-4.0.5. Code to
>> > reproduce is:
>> >
>> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
>> > \360\237\247\222\360\237\217\274\n|
\360\237\247\222\360\237\217\275\n|
>> > \360\237\247\222\360\237\217\276\n|
\360\237\247\222\360\237\217\277\n"
>> > Encoding(x) <- "unknown"
>> > Sys.setlocale(locale="C")
>> > sessionInfo()
>> > file.exists(x)
>> >
>> > Output I got from R-4.0.5 was
>> >
>> >> sessionInfo()
>> > R version 4.0.5 (2021-03-31)
>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>> > Running under: Windows 10 x64 (build 19042)
>> >
>> > Matrix products: default
>> >
>> > locale:
>> > [1] C
>> > system code page: 1252
>> >
>> > attached base packages:
>> > [1] stats graphics grDevices utils datasets methods
base
>> >
>> > loaded via a namespace (and not attached):
>> > [1] compiler_4.0.5
>> >> file.exists(x)
>> > Error in file.exists(x) : file name conversion problem -- name
too long?
>> > Execution halted
>> >
>> > Output I got from R-devel was
>> >
>> >> sessionInfo()
>> > R Under development (unstable) (2021-04-26 r80229)
>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>> > Running under: Windows 10 x64 (build 19042)
>> >
>> > Matrix products: default
>> >
>> > locale:
>> > [1] C
>> >
>> > attached base packages:
>> > [1] stats graphics grDevices utils datasets methods
base
>> >
>> > loaded via a namespace (and not attached):
>> > [1] compiler_4.2.0
>> >> file.exists(x)
>> > [1] FALSE
>> >
>> > I also observed similar results when using normalizePath
instead of
>> > file.exists (error in R-4.0.5, no error in R-devel).
>> >
>> >> normalizePath(x) #R-4.0.5
>> > Error in path.expand(path) : unable to translate 'p'
>> > | p'p;
>> > | p'p<
>> > | p'p=
>> > | p'p>
>> > | p'p<bf>
>> > ' to UTF-8
>> > Calls: normalizePath -> path.expand
>> > Execution halted
>> >
>> >> normalizePath(x) #R-devel
>> > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
>> > \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n|
>> > \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n|
>> > \360\237\247\222\360\237\217\277\n"
>> > Warning message:
>> > In normalizePath(path.expand(path), winslash, mustWork) :
path[1]="🧒
>> > | 🧒🏻
>> > | 🧒🏼
>> > | 🧒🏽
>> > | 🧒🏾
>> > | 🧒🏿
>> > ": The filename, directory name, or volume label syntax is
incorrect