WinXP-x32, R-21.13.0 Dear list, I have a problem that (I think) relates to the interaction between Windows and R. I am trying to scrape a table with data on the Hawai'ian Islands, This is my code: library(XML) u <- "http://en.wikipedia.org/wiki/Hawaii" tables <- readHTMLTable(u) Islands <- tables[[5]] The output is (first set of columns): Island Nickname > Islands Island Nickname Location 1 Hawaiûi[7] The Big Island 19ð34′N 155ð30′W / 19.567 ðN 155.5ðW / 19.567; -155.5 2 Maui[8] The Valley Isle 20ð48′N 156ð20′W / 20.8ðN 156.333ðW / 20.8; -156.333 3 Kahoûolawe[9] The Target Isle 20ð33′N 156ð36′W / 20.55 ðN 156.6ðW / 20.55; -156.6 4 LÃnaûi[10] The Pineapple Isle 20ð50′N 156ð56′W / 20.833ðN 15 6.933ðW / 20.833; -156.933 5 Molokaûi[11] The Friendly Isle 21ð08′N 157ð02′W / 21.133ðN 1 57.033ðW / 21.133; -157.033 6 Oûahu[12] The Gathering Place 21ð28′N 157ð59′W / 21.467ðN 1 57.983ðW / 21.467; -157.983 7 Kauaûi[13] The Garden Isle 22ð05′N 159ð30′W / 22.083 ðN 159.5ðW / 22.083; -159.5 8 Niûihau[14] The Forbidden Isle 21ð54′N 160ð10′W / 21.9ðN 160.167ðW / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding = "UTF-8") but that didn't help. It seems to me that there may be an issue with the interaction of the Windows settings of the character set. sessionInfo() gives > sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 [4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_3.2-0.2 > I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response: > Sys.setlocale("LC_ALL", "en_US.UTF-8") [1] "" Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored > In addition, I have attempted to make the change directly from the windows command prompt, using: "chcp 65001" and variations of that, but that didn't change anything. I have searched the list and the web and have found others bringing forth a similar issues, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86. Is there a way to make R override the windows settings or can the issue be solved otherwise? I have also tried other websites, and the issue occurs every time when there is an é, Ì, À, î, et cetera in the text-to-be-scraped. Thank you, Roger ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.