Hi Ryusuke I would use the encoding parameter of htmlParse() and download and parse the content in one operation:
htmlParse("http://home.sina.com", encoding = "UTF-8")
If you want to use getURL() in RCurl, use the .encoding parameter
You didn't tell us the output of Sys.getlocale()
or how your terminal/console is configured, so the above
may vary under your configuration, but works on various
machines for me with different settings.
D.
Ryusuke Kenji wrote:
>
> Hi All,
>
> First method:-
> >library(XML)
>
> >theurl <- "http://home.sina.com"
> >download.file(theurl, "tmp.html")
>
> >txt <- readLines("tmp.html")
>
> >txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes =
> TRUE)
>
> >g <- xpathSApply(txt, "//p", function(x) xmlValue(x))
>
> >head(grep(" ", g, value=T))
>
>
> [1] "?????? | ?????? | ENGLISH"
> "??????????????? ???????????????"
> [3] "??????? ?????????? ??????????????????(???)"
> "?????????????????????????????? ????????????????????????"
> [5] " ???????????????????????????????????????" "? ??????????!
> ????? ??????! ????????????????????????!"
>
>
>
> SecondMethod:-
> >library(RCurl)
>
> >theurl <- getURL("http://home.sina.com",encoding='GB2312')
>
> >Encoding(theurl)
>
> [1]"unknown"
>
> >txt <- readLines(con=textConnection(theurl),encoding='GB2312')
> >txt[5:10] #show the lines which occurred encoding problem.
> [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"
> />"
> [2] "<title>SINA.com US ????????? -??????</title>"
> [3] "<meta name=\"Keywords\" content=\"????????????, ???????????????,
> ???????????????, ??????????????????,????????????, SINA, US, News, Chinese,
> Asia\" />"
> [4] "<meta name=\"Description\"
> content=\"???????????????????????????????????????,
> ???????????????24????????????????????????????????, ????????????????????????,
> ????????????, ??????????????????, ????????????????????????, ?????????BBS,
> ???????????????????????????????????.\" />"
> [5]""
>
>
>
> [6] "<link rel=\"stylesheet\" type=\"text/css\"
> href=\"http://ui.sina.com/assets/css/style_home.css\" />"
>
> i am trying to read data from a Chinese language website, but the Chinese
> characters always unreadable, may I know if any good idea to cope such
> encoding problem in RCurl and XML?
>
>
> Regards,
> Ryusuke
>
> _________________________________________________________________
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
"There are men who can think no deeper than a fact" - Voltaire
Duncan Temple Lang [email protected]
Department of Statistics work: (530) 752-4782
4210 Mathematical Sciences Bldg. fax: (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA
pgpYi9CYtba6H.pgp
Description: PGP signature
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

