Hi Milan, a <- getURL(con, .encoding = "UTF-8") Encoding(a) > [1] "UTF-8" a # Here - the UTF-8 codes looks like fine. htmlParse(a, encoding = "UTF-8") ###again same encoding issue >>why didn't getURL() detect and set a's encoding correctly? I think there are page issue because another sites works fine
2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a écrit : > > Hi Milan! > > > > > > > Encoding(a) > > [1] "unknown" > Hm, here I get "UTF-8", which is my locale encoding. > > I've tried a little more, and I discovered that using > a <- getURL(u, .encoding="UTF-8") > ensures that a is in the correct encoding here. I know this is not your > problem, but it might help: check whether Encoding(a) is set to "UTF-8" > or not in that case, and whether this fixes things. > > I'm not sure how htmlParse() detects the encoding when you pass it a > character vector, but it probably uses Encoding(a), since that's the > only reliable information; if it is missing, maybe it falls back to what > the contents of the file say (maybe even before what the "encoding" > argument says), which is windows-1251, and may not be the encoding in > which getURL() saved the character vector. The question would then be: > why didn't getURL() detect and set a's encoding correctly? > > > My two cents > > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a écrit : > > > Hello dear R-help mailing list. > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > library(RCurl) > > > > > > library(XML) > > > > > > u = " > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > a = getURL(u) > > > > > > a # Here - the Russian is fine. > > > > > > a2 <- htmlParse(a) > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > Any suggestions? > > > > What does Encoding(a) say? > > > > > > (FWIW, here on Linux even a is not in the correct encoding : > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > <html><head> > > <title>ÐÐÐÑÐÐÐÐÐÐÐÑ Ð®Ð¤ÐÂЮÐÐЮЬÐÂÐ > > ÐÐÐÂÐÑÐÑ ÐÐÐÑРаÐÐÐÐа > > ÐÑ ÐÑ ÐÐЮР> > ±ÐÐÐÑÐÒ Ðâ 11430 ЮÐÐÐÑÐÑÐÑЫÐÒÐÂÐÐЩ > > Ю ÐÐаЮФРЦÐÒ Ð®Ð¤Ð > > ЮÐÐЮЬР> > Ð ÐÐÐÂле ÐÐÐÑРаÐÐÐÐа</title> > > [...]) > > > > > > Regards > > > > > > > Thanks you very much in advance, > > > > > > Lavrentiy Eskin > > > > > <http://www.eng.nvg.ru> > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible > > code. > > > > > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.