Hi, I am having problems getting similar output when processing the same markdown files on 2 different Linux systems (one is a laptop with Linux Mint 18.3, the other is a production server running on CentOS 7). I think this boils down to an encoding issue but I am not sure if this is a system-wide issue or an R issue. So, this is what I have so far.
I have this very small dummy html file (with the same md5sum on both systems) which only contains 3 characters. A "od -cx" call provides the same output in both systems: 0000000 r 342 200 231 s \n e272 9980 0a73 The middle character is some form of single quote produced by the conversion of a ' character from markdown to html. Reading the same file in both systems and applying a gsub replace provide widely different results. ####On my laptop # environment variable: echo $LANG: en_US.UTF-8 > x <- scan('test.html', what='character', sep='\n') Read 1 item > x [1] "r’s" > gsub('\\s{2,}', ' ', x) [1] "r’s" > sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 18.3 Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.4 ####On the server # environment variable: echo $LANG: en_US.UTF-8 > x <- scan('test.html', what='character', sep='\n') Read 1 item > x [1] "râs" > gsub('\\s{2,}', ' ', x) [1] " " > sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /usr/lib64/R/lib/libRblas.so LAPACK: /usr/lib64/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.3 (The overarching issue is that I have to use the production server for SOP reasons, so I cannot simply ignore the problem and use my laptop). I would appreciate any suggestions on how to approach this issue. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.