Re: [R] Problem comparing two strings

Ivan Krylov Mon, 18 Nov 2019 07:36:00 -0800

On Mon, 18 Nov 2019 16:11:44 +0100
"Björn Fisseler" <bjoern.fisse...@googlemail.com> wrote:


> It's obviously the umlaut "ä" in this example which is encoded with
> two respectively three bytes. The question is how to change this?

Welcome to the wonderful world of Unicode-related problems! It is,
indeed, possible to represent the same glyph using either one
code-point (LATIN SMALL LETTER A WITH DIAERESIS) or two code points
(LATIN SMALL LETTER A followed by COMBINING DIAERESIS). (Other
combinations of code points resulting in the same glyph are probably
also possible.)

What you are looking for is called "Unicode normalization" and it is
implemented in the stringi package, in functions stri_trans_nfc
(normalization: there are multiple normal forms to choose from but W3C
guidelines recommend NFC) and stri_compare / stri_cmp (test for
canonical equivalence).

See also: ?stringi::stri_cmp and https://stackoverflow.com/a/20684794

-- 
Best regards,
Ivan

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem comparing two strings

Reply via email to