[R] Encoding issue

Sebastien Bihorel Mon, 05 Nov 2018 05:37:44 -0800

Hi,

I am having problems getting similar output when processing the same markdown 
files on 2 different Linux systems (one is a laptop with Linux Mint 18.3, the 
other is a production server running on CentOS 7). I think this boils down to 
an encoding issue but I am not sure if this is a system-wide issue or an R 
issue. So, this is what I have so far.


I have this very small dummy html file (with the same md5sum on both systems) 
which only contains 3 characters. A "od -cx" call provides the same output in 
both systems:
0000000   r 342 200 231   s  \n
           e272    9980    0a73

The middle character is some form of single quote produced by the conversion of 
a ' character from markdown to html. Reading the same file in both systems and 
applying a gsub replace provide widely different results.

####On my laptop
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "r’s"
> gsub('\\s{2,}', ' ', x)
[1] "r’s"
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4

####On the server
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "râs"
> gsub('\\s{2,}', ' ', x)
[1] " "
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3

(The overarching issue is that I have to use the production server for SOP 
reasons, so I cannot simply ignore the problem and use my laptop).

I would appreciate any suggestions on how to approach this issue.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Encoding issue

Reply via email to