Hi Prof, Thank you for your reply. Sorry that I missed out the below information. >Sys.getlocale() [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
I have just noticed that traditional chinese character cause the encoding problem, while chinese simplified works fine. >library(RCurl) >theurl <- getURL("http://home.sina.com",encoding='utf8') #Encoding(theurl) #[1]"latin1" >txt <- readLines(con=textConnection(theurl),encoding='utf8') >write.table(file='D:/fileas.txt',txt) When I open the fileas.txt, the Chinese traditional character readable in notepad, but when I try to read file to Rgui:- > smple <- scan('D:/fileas.txt',what='') Then it comes to unrecognisable character again, I was wondering if Rgui support traditional Chinese character now... I think I need to looking for solution of inter-Chinese character's translation. Thank you. Best, Ryusuke =============================================== Hi Ryusuke I would use the encoding parameter of htmlParse() and download and parse the content in one operation: htmlParse("http://home.sina.com", encoding = "UTF-8") If you want to use getURL() in RCurl, use the .encoding parameter You didn't tell us the output of Sys.getlocale() or how your terminal/console is configured, so the above may vary under your configuration, but works on various machines for me with different settings. D. Ryusuke Kenji wrote: > > Hi All, > > First method:- > >library(XML) > > >theurl <- "http://home.sina.com" > >download.file(theurl, "tmp.html") > > >txt <- readLines("tmp.html") > > >txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = > TRUE) > > >g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) > > >head(grep(" ", g, value=T)) > > > [1] "?????? | ?????? | ENGLISH" > "??????????????? ???????????????" > [3] "??????? ?????????? ??????????????????(???)" > "?????????????????????????????? ????????????????????????" > [5] " ???????????????????????????????????????" "? ??????????! > ????? ??????! ????????????????????????!" > > > > SecondMethod:- > >library(RCurl) > > >theurl <- getURL("http://home.sina.com",encoding='GB2312') > > >Encoding(theurl) > > [1]"unknown" > > >txt <- readLines(con=textConnection(theurl),encoding='GB2312') > >txt[5:10] #show the lines which occurred encoding problem. > [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" > />" > [2] "<title>SINA.com US ????????? -??????</title>" > [3] "<meta name=\"Keywords\" content=\"????????????, ???????????????, > ???????????????, ??????????????????,????????????, SINA, US, News, Chinese, > Asia\" />" > [4] "<meta name=\"Description\" > content=\"???????????????????????????????????????, > ???????????????24????????????????????????????????, ????????????????????????, > ????????????, ??????????????????, ????????????????????????, ?????????BBS, > ???????????????????????????????????.\" />" > [5]"" > > > > [6] "<link rel=\"stylesheet\" type=\"text/css\" > href=\"http://ui.sina.com/assets/css/style_home.css\" />" > > i am trying to read data from a Chinese language website, but the Chinese > characters always unreadable, may I know if any good idea to cope such > encoding problem in RCurl and XML? > > > Regards, > Ryusuke > > _________________________________________________________________ > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- "There are men who can think no deeper than a fact" - Voltaire Duncan Temple Lang dun...@wald.ucdavis.edu Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Bldg. fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA _________________________________________________________________ ¥á©`¥ë¤òÒ»À¨¥Á¥§¥Ã¥¯£¡Ëû¤ÎoÁÏ¥á©`¥ë¤â¥×¥í¥Ð¥¤¥À©`¥á©`¥ë¤â¡£ [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.