On Mon, 18 Nov 2019 16:11:44 +0100 "Björn Fisseler" <bjoern.fisse...@googlemail.com> wrote:
> It's obviously the umlaut "ä" in this example which is encoded with > two respectively three bytes. The question is how to change this? Welcome to the wonderful world of Unicode-related problems! It is, indeed, possible to represent the same glyph using either one code-point (LATIN SMALL LETTER A WITH DIAERESIS) or two code points (LATIN SMALL LETTER A followed by COMBINING DIAERESIS). (Other combinations of code points resulting in the same glyph are probably also possible.) What you are looking for is called "Unicode normalization" and it is implemented in the stringi package, in functions stri_trans_nfc (normalization: there are multiple normal forms to choose from but W3C guidelines recommend NFC) and stri_compare / stri_cmp (test for canonical equivalence). See also: ?stringi::stri_cmp and https://stackoverflow.com/a/20684794 -- Best regards, Ivan ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.