Hi All, First method:- >library(XML)
>theurl <- "http://home.sina.com" >download.file(theurl, "tmp.html") >txt <- readLines("tmp.html") >txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) >g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) >head(grep(" ", g, value=T)) [1] "ç¹é« | ç°¡é« | ENGLISH" "女æ²å µç«å´ é»é¢¨æ伺å" [3] "鬼åé ç¾å°å¥³ é¸ç¾çå·ç¨±å¾(å)" "ï¼åéï¼æ§æèªç©ºå»£å ä¿å裸空å§æ´é£æ©" [5] " åæµ·åå¿æ ¶éåä¹å°ç£ç°å³¶æ¸¸" "é è³¼æ©ç¥¨! éæ 游å¡! æ½äºæé åºåæ©ç¥¨!" SecondMethod:- >library(RCurl) >theurl <- getURL("http://home.sina.com",encoding='GB2312') >Encoding(theurl) [1]"unknown" >txt <- readLines(con=textConnection(theurl),encoding='GB2312') >txt[5:10] #show the lines which occurred encoding problem. [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />" [2] "<title>SINA.com US æ°æµªç¶² -åç¾</title>" [3] "<meta name=\"Keywords\" content=\"åç¾æ°æµª, æ°æµªåç¾ç«, ç¾åä¸æ網, åç¾ä¸æ網ç«,è¯äººç¶²ç«, SINA, US, News, Chinese, Asia\" />" [4] "<meta name=\"Description\" content=\"åç¾å°åæ大çä¸æ網絡åªé«, çºæµ·å¤è¯äºº24å°æä¸éæ·æä¾æµ·éè³è¨, å §å®¹å æ¬ææ°æ°è, å¨æ¨è¨æ¯, 實ç¨ç§»æ°è³è¨, è¡å¸å¯å¸è²¡ç¶ä¿¡æ¯, é«äººæ°£BBS, é¢ååç¾è¯äººç交åå¹³å°ç.\" />" [5]"" [6] "<link rel=\"stylesheet\" type=\"text/css\" href=\"http://ui.sina.com/assets/css/style_home.css\" />" i am trying to read data from a Chinese language website, but the Chinese characters always unreadable, may I know if any good idea to cope such encoding problem in RCurl and XML? Regards, Ryusuke _________________________________________________________________ [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.