Hi Ivan,

0xe2 0x80 0x99 seems to be the UTF-8 hex code for Unicode Character 'RIGHT 
SINGLE QUOTATION MARK', which would make sense in the context.

Using the encoding argument for the scan call does not change the outcome.

Looking at the server side a bit more, some colleagues pointed out that the 
"râs" display could be a side-effect of encoding issue with Putty (which I used 
to connect to the remote server). Changing the setting of Putty display, I get 
the correct display "r’s"... However, that does not change anything to the gsub 
issue...

Sebastien

----- Original Message -----
From: "Ivan Krylov" <krylov.r...@gmail.com>
To: "Sebastien Bihorel" <sebastien.biho...@cognigencorp.com>
Cc: r-help@r-project.org
Sent: Monday, November 5, 2018 2:34:02 PM
Subject: Re: [R] Encoding issue

On Mon, 5 Nov 2018 08:36:13 -0500 (EST)
Sebastien Bihorel <sebastien.biho...@cognigencorp.com> wrote:

> [1] "râs"

Interesting. This is what I get if I decode the bytes 72 e2 80 99 73 0a
as latin-1 instead of UTF-8. They look like there is only three
characters, but, actually, there is more:

$ perl -CSD -Mcharnames=:full -MEncode=decode \
 -E'for (split //, decode latin1 => pack "H*", "72e28099730a")
 { say ord, " ", $_, " ", charnames::viacode(ord) }'
114 r LATIN SMALL LETTER R
226 â LATIN SMALL LETTER A WITH CIRCUMFLEX
128  PADDING CHARACTER
153  SINGLE GRAPHIC CHARACTER INTRODUCER
115 s LATIN SMALL LETTER S
10 
 LINE FEED

Does it help if you explicitly specify the file encoding by passing
fileEncoding="UTF-8" argument to scan()?

-- 
Best regards,
Ivan

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to