+1 for Martin's proposal, that makes sense to me too. About Tomas' idea to immediately stop with an error when the user tries to create a string which is invalid in its declared encoding, that sounds great. I'm just wondering if that would break my application. My package is running an example during a check, in which the unicode/emoji is read into R using readLines from a file under inst/extdata, so presumably it should work as long as readLines handles the encoding correctly and/or the locale during package check is changed to something more reasonable on windows?
On Wed, Apr 28, 2021 at 9:04 AM Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > > On 4/28/21 5:22 PM, Martin Maechler wrote: > >>>>>> Toby Hocking > >>>>>> on Wed, 28 Apr 2021 07:21:05 -0700 writes: > > > Hi Tomas, thanks for the thoughtful reply. That makes sense about > the > > > problems with C locale on windows. Actually I did not choose to > use C > > > locale, but instead it was invoked automatically during a package > check. > > > To be clear, I do NOT have a file with that name, but I do want > file.exists > > > to return a reasonable value, FALSE (with no error). If that > behavior is > > > unspecified, then should I use something like > tryCatch(file.exists(x), > > > error=function(e)FALSE) instead of assuming that file.exists will > always > > > return a logical vector without error? For my particular > application that > > > work-around should probably be sufficient, but one may imagine a > situation > > > where you want to do > > > > > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| > > > \360\237\247\222\360\237\217\274\n| > \360\237\247\222\360\237\217\275\n| > > > \360\237\247\222\360\237\217\276\n| > \360\237\247\222\360\237\217\277\n" > > > Encoding(x) <- "unknown" > > > Sys.setlocale(locale="C") > > > f <- tempfile() > > > cat("", file = f) > > > two <- c(x, f) > > > file.exists(two) > > > > > and in that case the correct response from R, in my opinion, > would be > > > c(FALSE, TRUE) -- not an error. > > > Toby > > > > Indeed, thanks a lot to Tomas! > > > > # A remark > > We *could* -- and according to my taste should -- try to have > file.exists() > > return a logical vector in almost all cases, namely, e.g., still give an > > error for file.exists(pi) : > > Notably if `c(...)` {for the `...` arguments of file.exists() } > > is a character vector, always return a logical vector of the same > > length, *and* we could notably make use of the fact that R's > > logical type is not binary but ternary, and hence that return > > value could contain values from {TRUE, NA, FALSE} and interpret NA > > as "don't know" in all cases where the corresponding string in > > the input had an Encoding(.) that was "fishy" in some sense > > given the "context" (OS, locale, OS_version, ICU-presence, ...). > > > > In particular, when the underlying code sees encoding-translation issues > > for a string, NA would be returned instead of an error. > > Yes, I agree with Toby and you that there is benefit in allowing > per-element, vectorized use of file.exists(), and well it is the case > now, we just fall back to FALSE. NA might be be better in case of error > that prevents the function from deciding whether the file exists or not > (e.g. an invalid name in form that make is clear such file cannot exist > might be a different case...). > > But, the only way to get a translation error is by passing a string to > file.exists() which is invalid in its declared encoding (or which is in > "C" encoding). I would hope that we could get to the point where such > situation is prevented (we only allow creation of strings that can be > translated to Unicode). If we get there, the example would fail with > error (yet, right, before getting to file.exists()). > > My point that I would not write tests of this behavior stands. One > should not use such file names, and after the change Toby reported from > ERROR to FALSE, Martin's proposal would change to NA, mine eventually to > ERROR, etc. So it is best for now to leave it unspecified and not > trigger it, I think. > > Tomas > > > > > Martin > > > > > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera < > tomas.kalib...@gmail.com> > > > wrote: > > > > >> Hi Toby, > > >> > > >> a defensive, portable approach would be to use only file names > regarded > > >> portable by POSIX, so characters including ASCII letters, digits, > > >> underscore, dot, hyphen (but hyphen should not be the first > character). > > >> That would always work on all systems and this is what I would > use. > > >> > > >> Individual operating systems and file systems and their > configurations > > >> differ in which additional characters they support and how. On > some, > > >> file names are just sequences of bytes, on some, they have to be > valid > > >> strings in certain encoding (and then with certain exceptions). > > >> > > >> On Windows, file names are at the lowest level in UTF-16LE > encoding (and > > >> admitting unpaired surrogates for historical reasons). R stores > strings > > >> in other encodings (UTF-8, native, Latin-1), so file names have > to be > > >> translated to/from UTF-16LE, either directly by R or by Windows. > > >> > > >> But, there is no way to convert (non-ASCII) strings in "C" > encoding to > > >> UTF16-LE, so the examples cannot be made to work on Windows. > > >> > > >> When the translation is left on Windows, it assumes the > non-UTF-16LE > > >> strings are in the Active Code Page encoding (shown as "system > encoding" > > >> in sessionInfo() in R, Latin-1 in your example) instead of the > current C > > >> library encoding ("C" in your example). So, file names coming > from > > >> Windows will be either the bytes of their UTF-16LE > representation or the > > >> bytes of their Latin-1 representation, but which one is subject > to the > > >> implementation details, so the result is really unusable. > > >> > > >> I would say using "C" as encoding in R is not a good idea, and > > >> particularly not on Windows. > > >> > > >> I would say that what happens with such file names in "C" > encoding is > > >> unspecified behavior, which is subject to change at any time > without > > >> notice, and that both the R 4.0.5 and R-devel behavior you are > observing > > >> are acceptable. I don't think it should be mentioned in the NEWS. > > >> Personally, I would prefer some stricter checks of strings > validity and > > >> perhaps disallowing the "C" encoding in R, so yet another > behavior where > > >> it would be clearer that this cannot really work, but that would > require > > >> more thought and effort. > > >> > > >> Best > > >> Tomas > > >> > > >> > > >> On 4/27/21 9:53 PM, Toby Hocking wrote: > > >> > > >> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be > fixed in > > >> > R-devel already. I checked on > > >> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and > there is no > > >> > mention of these changes, so I'm wondering if they are > intentional? If > > >> so, > > >> > could someone please add a mention of the bugfix in the NEWS? > > >> > > > >> > The problem involves file.exists, on windows, when a > long/strange input > > >> > file name Encoding is unknown, in C locale. I expected that > FALSE should > > >> be > > >> > returned (and it is on R-devel), but I got an error in > R-4.0.5. Code to > > >> > reproduce is: > > >> > > > >> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| > > >> > \360\237\247\222\360\237\217\274\n| > \360\237\247\222\360\237\217\275\n| > > >> > \360\237\247\222\360\237\217\276\n| > \360\237\247\222\360\237\217\277\n" > > >> > Encoding(x) <- "unknown" > > >> > Sys.setlocale(locale="C") > > >> > sessionInfo() > > >> > file.exists(x) > > >> > > > >> > Output I got from R-4.0.5 was > > >> > > > >> >> sessionInfo() > > >> > R version 4.0.5 (2021-03-31) > > >> > Platform: x86_64-w64-mingw32/x64 (64-bit) > > >> > Running under: Windows 10 x64 (build 19042) > > >> > > > >> > Matrix products: default > > >> > > > >> > locale: > > >> > [1] C > > >> > system code page: 1252 > > >> > > > >> > attached base packages: > > >> > [1] stats graphics grDevices utils datasets methods > base > > >> > > > >> > loaded via a namespace (and not attached): > > >> > [1] compiler_4.0.5 > > >> >> file.exists(x) > > >> > Error in file.exists(x) : file name conversion problem -- name > too long? > > >> > Execution halted > > >> > > > >> > Output I got from R-devel was > > >> > > > >> >> sessionInfo() > > >> > R Under development (unstable) (2021-04-26 r80229) > > >> > Platform: x86_64-w64-mingw32/x64 (64-bit) > > >> > Running under: Windows 10 x64 (build 19042) > > >> > > > >> > Matrix products: default > > >> > > > >> > locale: > > >> > [1] C > > >> > > > >> > attached base packages: > > >> > [1] stats graphics grDevices utils datasets methods > base > > >> > > > >> > loaded via a namespace (and not attached): > > >> > [1] compiler_4.2.0 > > >> >> file.exists(x) > > >> > [1] FALSE > > >> > > > >> > I also observed similar results when using normalizePath > instead of > > >> > file.exists (error in R-4.0.5, no error in R-devel). > > >> > > > >> >> normalizePath(x) #R-4.0.5 > > >> > Error in path.expand(path) : unable to translate 'p' > > >> > | p'p; > > >> > | p'p< > > >> > | p'p= > > >> > | p'p> > > >> > | p'p<bf> > > >> > ' to UTF-8 > > >> > Calls: normalizePath -> path.expand > > >> > Execution halted > > >> > > > >> >> normalizePath(x) #R-devel > > >> > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n| > > >> > \360\237\247\222\360\237\217\273\n| > \360\237\247\222\360\237\217\274\n| > > >> > \360\237\247\222\360\237\217\275\n| > \360\237\247\222\360\237\217\276\n| > > >> > \360\237\247\222\360\237\217\277\n" > > >> > Warning message: > > >> > In normalizePath(path.expand(path), winslash, mustWork) : > path[1]="🧒 > > >> > | 🧒🏻 > > >> > | 🧒🏼 > > >> > | 🧒🏽 > > >> > | 🧒🏾 > > >> > | 🧒🏿 > > >> > ": The filename, directory name, or volume label syntax is > incorrect > > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel