base::icuSetCollate might be what you need. There are some decent examples in the manual page on it.
On Tue, Jan 19, 2021 at 7:30 AM Thierry Onkelinx via R-devel <r-devel@r-project.org> wrote: > > Dear Peter, > > Thanks for the feedback on the locale. Is there a better alternative for > the C locale? One that yields a consistent and stable sorting > independent of the R version and OS. > > Best regards, > > Thierry > > ir. Thierry Onkelinx > Statisticus / Statistician > > Vlaamse Overheid / Government of Flanders > INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND > FOREST > Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance > thierry.onkel...@inbo.be > Havenlaan 88 bus 73, 1000 Brussel > www.inbo.be > > /////////////////////////////////////////////////////////////////////////////////////////// > To call in the statistician after the experiment is done may be no more > than asking him to perform a post-mortem examination: he may be able to say > what the experiment died of. ~ Sir Ronald Aylmer Fisher > The plural of anecdote is not data. ~ Roger Brinner > The combination of some data and an aching desire for an answer does not > ensure that a reasonable answer can be extracted from a given body of data. > ~ John Tukey > /////////////////////////////////////////////////////////////////////////////////////////// > > <https://www.inbo.be> > > > Op di 19 jan. 2021 om 13:20 schreef Peter Dalgaard <pda...@gmail.com>: > > > Not sure what happened between 4.0.2 and -devel, but you are using C > > collation, which assumes 7-bit single-byte characters, to sort multi-byte > > 8-bit encoded characters, which looks a bit risky. > > > > -pd > > > > > On 19 Jan 2021, at 10:10 , Thierry Onkelinx via R-devel < > > r-devel@r-project.org> wrote: > > > > > > Dear all, > > > > > > My git2rdata package relies on a stable sorting. I've noticed that > > > some characters get a different position under R-devel under Windows > > > 10. This is why the unit test of my package only fail in this > > > combination ( > > https://cran.r-project.org/web/checks/check_results_git2rdata.html) > > > > > > Below is a minimal example to illustrate the problem. > > > > > > Best regards, > > > > > > Thierry > > > > > > data <- readLines(" > > https://raw.githubusercontent.com/ropensci/git2rdata/master/tests/testthat/test_b_special.R > > ", > > > encoding = "UTF-8", n = 15) > > > eval(parse(text = paste(tail(data, -3), collapse = ""))) > > > ds$a <- enc2utf8(ds$a) > > > print(ds$a) # input > > > Sys.setlocale(locale = "C") > > > print(sort(ds$a)) # sorted > > > print(order(ds$a)) # order > > > print(sessionInfo()) > > > > > > # input > > > ## Win 10 R 4.0.2 > > > [1] "a" "a b" "a\tb" "a\tb\tc" "\ta" "a\t" > > "a\nb" > > > [8] "a\nb\nc" "\na" "a\n" "a\"b" "a\"b\"c" "\"b" "a\"" > > > [15] "\"b\"" "a'b" "a'b'c" "'b" "a'" "'b'" "a b c" > > > [22] "\"NA\"" "'NA'" NA "é" "&" "à" "µ" > > > [29] "ç" "\200" "|" "#" "@" "$" > > > ## Win 10 R devel > > > [1] "a" "a b" "a\tb" "a\tb\tc" "\ta" "a\t" > > "a\nb" > > > [8] "a\nb\nc" "\na" "a\n" "a\"b" "a\"b\"c" "\"b" "a\"" > > > [15] "\"b\"" "a'b" "a'b'c" "'b" "a'" "'b'" "a b c" > > > [22] "\"NA\"" "'NA'" NA "é" "&" "à" "µ" > > > [29] "ç" "\200" "|" "#" "@" "$" > > > ## Ubuntu 18.04 R 4.0.3 > > > [1] "a" "a b" "a\tb" "a\tb\tc" "\ta" "a\t" "a\nb" > > > [8] "a\nb\nc" "\na" "a\n" "a\"b" "a\"b\"c" "\"b" "a\"" > > > [15] "\"b\"" "a'b" "a'b'c" "'b" "a'" "'b'" "a b c" > > > [22] "\"NA\"" "'NA'" NA "é" "&" "à" "µ" > > > [29] "ç" "€" "|" "#" "@" "$" > > > > > > # sorted > > > ## Win 10 R 4.0.2 > > > [1] "\ta" "\na" "\"NA\"" "\"b" "\"b\"" "#" "$" > > > [8] "&" "'NA'" "'b" "'b'" "<U+00B5>" "<U+00E0>" > > "<U+00E7>" > > > [15] "<U+00E9>" "<U+20AC>" "@" "a" "a\t" "a\tb" > > "a\tb\tc" > > > [22] "a\n" "a\nb" "a\nb\nc" "a b" "a b c" "a\"" "a\"b" > > > [29] "a\"b\"c" "a'" "a'b" "a'b'c" "|" > > > ## Win 10 R devel > > > [1] "\ta" "\na" "\"NA\"" "\"b" "\"b\"" "#" "$" > > > [8] "&" "'NA'" "'b" "'b'" "@" "a" "a\t" > > > [15] "a\tb" "a\tb\tc" "a\n" "a\nb" "a\nb\nc" "a b" "a b c" > > > [22] "a\"" "a\"b" "a\"b\"c" "a'" "a'b" "a'b'c" "|" > > > [29] "\200" "\265" "\340" "\347" "\351" > > > ## Ubuntu 18.04 R 4.0.3 > > > [1] "\ta" "\na" "\"NA\"" "\"b" "\"b\"" "#" "$" > > > [8] "&" "'NA'" "'b" "'b'" "<U+00B5>" "<U+00E0>" > > "<U+00E7>" > > > [15] "<U+00E9>" "<U+20AC>" "@" "a" "a\t" "a\tb" > > "a\tb\tc" > > > [22] "a\n" "a\nb" "a\nb\nc" "a b" "a b c" "a\"" "a\"b" > > > [29] "a\"b\"c" "a'" "a'b" "a'b'c" "|" > > > > > > # order > > > ## Win 10 R 4.0.2 > > > [1] 5 9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33 1 6 3 4 10 > > 7 8 2 > > > [26] 21 14 11 12 19 16 17 31 24 > > > ## Win 10 R devel > > > [1] 5 9 22 13 15 32 34 26 23 18 20 33 1 6 3 4 10 7 8 2 21 14 11 > > 12 19 > > > [26] 16 17 31 30 28 27 29 25 24 > > > ## Ubuntu 18.04 R 4.0.3 > > > [1] 5 9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33 1 6 3 4 10 > > 7 8 2 > > > [26] 21 14 11 12 19 16 17 31 24 > > > > > > R version 4.0.2 (2020-06-22) > > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > Running under: Windows 10 x64 (build 18363) > > > > > > Matrix products: default > > > > > > locale: > > > [1] C > > > system code page: 1252 > > > > > > attached base packages: > > > [1] stats graphics grDevices utils datasets methods base > > > > > > loaded via a namespace (and not attached): > > > [1] compiler_4.0.2 fortunes_1.5-4 > > > > > > R Under development (unstable) (2021-01-13 r79826) > > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > Running under: Windows 10 x64 (build 18363) > > > > > > Matrix products: default > > > > > > locale: > > > [1] C > > > > > > attached base packages: > > > [1] stats graphics grDevices utils datasets methods base > > > > > > loaded via a namespace (and not attached): > > > [1] compiler_4.1.0 > > > > > > R version 4.0.3 (2020-10-10) > > > Platform: x86_64-pc-linux-gnu (64-bit) > > > Running under: Ubuntu 18.04.5 LTS > > > > > > Matrix products: default > > > BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 > > > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 > > > > > > locale: > > > [1] LC_CTYPE=C LC_NUMERIC=C > > > [3] LC_TIME=C LC_COLLATE=C > > > [5] LC_MONETARY=C LC_MESSAGES=nl_BE.UTF-8 > > > [7] LC_PAPER=nl_BE.UTF-8 LC_NAME=C > > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > > [11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C > > > > > > attached base packages: > > > [1] stats graphics grDevices utils datasets methods base > > > > > > loaded via a namespace (and not attached): > > > [1] compiler_4.0.3 fortunes_1.5-4 > > > > > > > > > ir. Thierry Onkelinx > > > Statisticus / Statistician > > > > > > Vlaamse Overheid / Government of Flanders > > > INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE > > > AND FOREST > > > Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance > > > thierry.onkel...@inbo.be > > > Havenlaan 88 bus 73, 1000 Brussel > > > www.inbo.be > > > > > > > > /////////////////////////////////////////////////////////////////////////////////////////// > > > To call in the statistician after the experiment is done may be no > > > more than asking him to perform a post-mortem examination: he may be > > > able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher > > > The plural of anecdote is not data. ~ Roger Brinner > > > The combination of some data and an aching desire for an answer does > > > not ensure that a reasonable answer can be extracted from a given body > > > of data. ~ John Tukey > > > > > > ______________________________________________ > > > R-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > > Peter Dalgaard, Professor, > > Center for Statistics, Copenhagen Business School > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > > Phone: (+45)38153501 > > Office: A 4.23 > > Email: pd....@cbs.dk Priv: pda...@gmail.com > > > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel