I confirm that the original problem doesn't happen in R 3.1.1. in Windows (XP, this time). That is,
source("http://psych.ut.ee/~R/test-utf8.txt") .. no longer crashes R but gives a sensible (i.e., understandable, after this discussion) error. ... and adding encoding="UTF-8-BOM" reads in the file correctly. On Thu, Jul 10, 2014 at 5:50 PM, Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > On 10/07/2014 9:53 AM, Kenn Konstabel wrote: >> >> Wow. Thanks a lot! >> >> source("http://psych.ut.ee/~nek/R/test-utf8.txt", encoding="UTF-8-BOM") >> # works correctly on my Windows 7 machine >> # (and without encoding argument it still crashes R) >> >> Kenn >> >> On Thu, Jul 10, 2014 at 4:33 PM, John McKown >> <john.archie.mck...@gmail.com> wrote: >> > On Thu, Jul 10, 2014 at 7:18 AM, Kenn Konstabel <lebats...@gmail.com> >> > wrote: >> >> Dear all, >> >> >> >> I found an unexpected behaviour when trying to `source` an utf-8 file >> >> on windows 7: >> >> >> >> source("http://psych.ut.ee/~nek/R/test-utf8.txt") >> >> >> >> # Rgui.exe reacts: >> >> # R for windows GUI has stopped working. A problem caused the program >> >> to stop working correctly. >> >> # Windows will close the program and notify you if a solution is >> >> available. >> >> >> >> The same will happen with R.exe ("terminal") and R running wihin >> >> Rstudio. (Session and locale info below). >> >> >> >> However, a non-utf version of this little script can be `source`d >> >> without problems. >> >> >> >> source("http://psych.ut.ee/~nek/R/test.txt") >> >> >> >> Adding the `encoding` argument to `source` helps a little: >> >> >> >> source("http://psych.ut.ee/~nek/R/test-utf8.txt", encoding="utf-8") >> >> # unsure about the spelling of utf-8 so I also tried UTF8, utf8, and >> >> UTF-8 >> >> # ... with the same result in all cases >> >> >> >> R doesn't crash any more but gives the following error: >> >> >> >> # Error in source("http://psych.ut.ee/~nek/R/test-utf8.txt", encoding >> >> = "utf-8") : >> >> # http://psych.ut.ee/~nek/R/test-utf8.txt:2:0: unexpected end of >> >> input >> >> # 1: ? >> >> # ^ >> >> # In addition: Warning message: >> >> # In readLines(file, warn = FALSE) : >> >> # invalid input found on input connection >> >> 'http://psych.ut.ee/~nek/R/test-utf8.txt' >> > >> > I just tried that. On Windows XP/Pro, R 3.1.0 didn't fail, but did >> > get the error you mention later. I used "wget" to actually download >> > the file mentioned (on Linux). I think that the problem _may_ be that >> > the file starts with a BOM (Byte Order Mark), which is 0xef, 0xbb, >> > 0xef . This is supposed to tell us that this is UTF-8. >> > >> > BOM: http://en.wikipedia.org/wiki/Byte_order_mark >> > >> > I get an identical error with R 3.1.0 on both Windows XP/Pro and Linux >> > Fedora 20. The problem is that the R readLines() apparently does not >> > like the leading BOM. It reads it as data. Most other Linux and >> > Windows applications _do_ understand the BOM and so, when you use >> > them, they work properly. And, normally, when you then save the file, >> > the software does not write the BOM at the start. So it works on the >> > saved version of the file. >> > >> > Being the curious sort, I decided to look at the source to R. In >> > particular in ~/R/src/main/connections.c I saw where it did support >> > the reading of BOMs. But there is a special way to do it! Which I >> > cannot find in the documentation. >> > >> > source("http://psych.ut.ee/~nek/R/test-utf8.txt",encoding="UTF-8-BOM"); >> > >> > I tried the above AND IT WORKED properly! >> > >> > I simply adore having source code. > > > Searching the source for the string "UTF-8-BOM" finds it mentioned in the > docs in 3 places: in the NEWS file, > in the R Data Import/Export manual, and in the ?connections help page. > > Duncan Murdoch > >> > >> > >> >> >> >> I thought maybe that's because what notepad told me is UTF-8 is >> >> actually something else ... so I did two more experiments. >> >> >> >> source("http://psych.ut.ee/~nek/R/test2.R") >> >> # this was created on a linux machine with leafpad, and saved as utf-8 >> >> text >> >> # it can be source´d on windows >> >> >> >> source("http://psych.ut.ee/~nek/R/test3.R") >> >> # the same as previous but o's in file were replaced by ö's >> >> # can be source'd on windows but the "ö" character is shown as ƶ >> >> # except if you add encoding="utf-8" - then, as expected, it works as >> >> expected >> >> >> >> So in sum, I can create "plain text" (saved with utf-8 encoding) files >> >> on windows that cannot be sourced to R on windows, or will crash R >> >> (depending on how you source them). The same files can be sourced on >> >> linux without problems. Part of the problem is obviously in windows >> >> but maybe R shouldn't at least crash. >> >> >> >> Session info: >> >> >> >> R version 3.0.2 (2013-09-25) >> >> Platform: i386-w64-mingw32/i386 (32-bit) >> >> >> >> locale: >> >> [1] LC_COLLATE=Estonian_Estonia.1257 LC_CTYPE=Estonian_Estonia.1257 >> >> [3] LC_MONETARY=Estonian_Estonia.1257 LC_NUMERIC=C >> >> [5] LC_TIME=Estonian_Estonia.1257 >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> loaded via a namespace (and not attached): >> >> [1] tools_3.0.2 >> >> >> >> >> >> OS: Windows 7 >> >> >> >> Linux Mint Debian Edition and R 3.0.2 on the other machine (where >> >> everything worked). >> >> >> >> Context: >> >> >> >> I was trying to find out how to make files that could be source'd on >> >> both windows and linux. This is partly solved so I have no specific >> >> question other than "is this a bug in windows version?" but any >> >> comments on the general topic would be appreciated too. >> >> >> >> Best regards, >> >> >> >> Kenn >> >> >> >> >> >> Kenn Konstabel >> >> Research fellow >> >> Department of chronic diseases >> >> National Institute of Health Development >> >> Hiiu 42 >> >> Tallinn >> >> Estonia >> > >> > -- >> > There is nothing more pleasant than traveling and meeting new people! >> > Genghis Khan >> > >> > Maranatha! <>< >> > John McKown >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.