Recent functionality in jsonlite allows for streaming json to a user supplied connection object, such as a file, pipe or socket. RFC7159 prescribes json must be encoded as unicode; ISO-8859 (including latin1) is invalid. Hence I would like R to write strings as utf8, irrespective of the type of connection, platform or locale. Implementing this turns out to be unsurprisingly difficult on windows.
> string <- enc2utf8("Zürich") > Encoding(string) [1] "UTF-8" For example when writing the utf8 string to a binary utf8 binary connection, the output seems to be latin1: > con <- file("test1.txt", open="wb", encoding = "UTF-8") > writeLines(string, con) > close(con) > system("file test1.txt") test1.txt: ISO-8859 text > readLines("test1.txt", encoding="UTF-8") [1] "Z\xfcrich" I am not quite sure if this is a bug or expected. To avoid this and other problems, jsonlite uses the 'useBytes` argument, which is supposed to suppress re-encoding when writing to the connection. This is exactly what we need: use enc2utf8 to convert our string to utf8 and then pass it byte-by-byte to the connection: > con <- file("test2.txt", open="wb", encoding = "UTF-8") > writeLines(string, con, useBytes = TRUE) > close(con) > system("file test2.txt") test2.txt: UTF-8 Unicode text > readLines("test2.txt", encoding="UTF-8") [1] "Zürich" However useByes results in incorrect output for non-binary connections. Not sure what is the intention here but it looks as if the string gets re-encoded one time too often: > con <- file("test3.txt", open="w", encoding = "UTF-8") > writeLines(string, con, useBytes = TRUE) > close(con) > system("file test3.txt") test3.txt: UTF-8 Unicode text, with CRLF line terminators > readLines("test3.txt", encoding="UTF-8") [1] "Zürich Strangely we do get utf8 output of we set the encoding of the connection to latin1. This suggests that there *is* some re-encoding going on, in contrast to what the useBytes manual states. > con <- file("test4.txt", open="w", encoding = "latin1") > writeLines(string, con, useBytes = TRUE) > close(con) > system("file test4.txt") test4.txt: UTF-8 Unicode text, with CRLF line terminators > readLines("test4.txt", encoding="UTF-8") [1] "Zürich" However useBytes is definitely not ignored either, because disabling it will (now correctly) write latin1 again: > con <- file("test5.txt", open="w", encoding = "latin1") > writeLines(string, con, useBytes = FALSE) > close(con) > system("file test5.txt") test5.txt: ISO-8859 text, with CRLF line terminators > readLines("test5.txt", encoding="UTF-8") [1] "Z\xfcrich" I am going to stop here. My primary question is: what is the best method to write a utf8 string as utf8 to an arbitrary connection object, without any re-encoding, that works on any platform and locale. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel