Le jeudi 21 février 2013 à 18:53 +0400, Lawr Eskin a écrit : > iconv trued before in various try, same issue and result with encoding > = unknown > now try sub - same issue This procedure works on Linux, but not on Windows:
library(RCurl) library(XML) u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" a <- getURL(u, .encoding="UTF-8") a <- iconv(a, "windows-1251", "UTF-8") a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) a2 But maybe the problem is more general, and related to conversion between encodings on Windows. What looks weird to me is that on Windows, I'm not able to save a character string to a file in UTF-8, despite what ?file says: x <- "Все права защищены" Encoding(x) # UTF-8 cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con) x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con) Encoding(x2) # unknown x2 # [1] "<U+041A><U+0443>..." I know the problem happens on write because the file cannot be read correctly on Linux either. This Windows machine uses Windows Server 2008 with French_France.1252 locale. > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit : > > Hi Milan, > > > > a <- getURL(con, .encoding = "UTF-8") > > Encoding(a) > > > [1] "UTF-8" > > a # Here - the UTF-8 codes looks like fine. > > htmlParse(a, encoding = "UTF-8") ###again same encoding > issue > > And what if you try this: > a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) > > or this: > a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8")) > > > Cheers > > > > >>why didn't getURL() detect and set a's encoding correctly? > > I think there are page issue because another sites works > fine > > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a > écrit : > > > Hi Milan! > > > > > > > > > > Encoding(a) > > > [1] "unknown" > > > > Hm, here I get "UTF-8", which is my locale encoding. > > > > I've tried a little more, and I discovered that > using > > a <- getURL(u, .encoding="UTF-8") > > ensures that a is in the correct encoding here. I > know this is > > not your > > problem, but it might help: check whether > Encoding(a) is set > > to "UTF-8" > > or not in that case, and whether this fixes things. > > > > I'm not sure how htmlParse() detects the encoding > when you > > pass it a > > character vector, but it probably uses Encoding(a), > since > > that's the > > only reliable information; if it is missing, maybe > it falls > > back to what > > the contents of the file say (maybe even before what > the > > "encoding" > > argument says), which is windows-1251, and may not > be the > > encoding in > > which getURL() saved the character vector. The > question would > > then be: > > why didn't getURL() detect and set a's encoding > correctly? > > > > > > My two cents > > > > > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > > Le jeudi 21 février 2013 à 13:16 +0400, > Lawr Eskin a > > écrit : > > > > Hello dear R-help mailing list. > > > > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > > > > > library(RCurl) > > > > > > > > library(XML) > > > > > > > > u = " > > > > > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > > > a = getURL(u) > > > > > > > > a # Here - the Russian is fine. > > > > > > > > a2 <- htmlParse(a) > > > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > > > > > Any suggestions? > > > > > > What does Encoding(a) say? > > > > > > > > > (FWIW, here on Linux even a is not in the > correct > > encoding : > > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML > 4.0 > > Transitional//EN" > > > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > > <html><head> > > > <title>ГЉГіГЇГЁГІГј îäГîêîìГГ ГІГ > ГіГѕ ГЄГўГ > > ðòèð > > > Гі Гў Ìîà > > > ±ГЄГўГҐ В— 11430 îáúÿâëåГГЁГ© Г® > ïðîäà > > æå îäà > > > îêîìà > > > Г ГІГûõ êâà ðòèð</title> > > > [...]) > > > > > > > > > Regards > > > > > > > > > > Thanks you very much in advance, > > > > > > > > Lavrentiy Eskin > > > > > > > <http://www.eng.nvg.ru> > > > > > > > > [[alternative HTML version > deleted]] > > > > > > > > > ______________________________________________ > > > > R-help@r-project.org mailing list > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide > > > > http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, > self-contained, > > reproducible > > > code. > > > > > > > > > > > > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.