Tal, OK, let me clarify my understanding. The original and decoded file are text, encoded by UTF-8. In the original file, there are HTML `entities' that represent UTF-8 Hebrew characters. In the decoded file, the entities are converted to UTF-8 characters. The question is how to convert these entities within R. It's not the same as converting between character encodings, otherwise iconv() might offer a solution.
I'll have a look around to find a solution, and I hope others will too. My first idea is to check RCurl, XML, and the related utils::URLdecode. If there really is no existing solution, I think it might be worthwhile to look at how PHP and Python do it (and maybe borrow some code :) ). -Matt On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote: > Hi Matt, > Thanks for having a look at this. > I just spent some time looking around and couldn't find any R function > to decode decimal HTML code. > > > Do you (or someone else on the list) knows how to program this sort of > thing? (is there a formula for the translation? > > > > > p.s: > For it to work on my end I added the encoding parameter: > readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE, > encoding= "UTF-8") > > > p.p.s: The Hebrew word I used means "peace" > > > Cheers, > Tal > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: tal.gal...@gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) > | www.r-statistics.com (English) > ---------------------------------------------------------------------------------------------- > > > > > On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <shotw...@musc.edu> > wrote: > Tal, > > It looks like the data you received has HTML special hex > characters. > That is, 'ש' is just an ASCII HTML representation of a > hex > character. It's not encoded in a special manner. > > The trick is to substitute the HTML encoded hex character for > its binary > representation, or "decode" the character. I don't know of any > R > function that does this, but there are web services, for > example: > http://www.hashemian.com/tools/html-url-encode-decode.php > > I decoded your file using this service and posted it on my > website. You > can see the difference by running: > > readLines("http://biostatmatt.com/temp/Hebrew-original", > warn=FALSE) > > readLines("http://biostatmatt.com/temp/Hebrew-decoded", > warn=FALSE) > > The second should display the Hebrew characters correctly (it > does in my > terminal). The next thing to think about is how to automate > this in R > without using the web service... We may need to write an > HTMLDecode > function if there isn't one already. > > By the way, what's the Hebrew text in English? > > Best, > Matt > > > > On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote: > > I am bumping this question in the hopes that someone might > be able to > > advise. > > This Hebrew and R business is not as smooth as I had > hoped... > > > > Thanks, > > Tal > > > > Older massage: > > > > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili > <tal.gal...@gmail.com> wrote: > > > > > Hello all, > > > > > > # I am trying to read the text in this URL: > > > u <- > > > http://google.com/complete/search?output=toolbar&q=%d7%a9% > d7%9c%d7%95%d7%9d > > > # By using this command: > > > readLines(u) > > > > > > And no matter what variation I tried, I keep getting this > output: > > > [1] "<?xml version=\"1.0 > \"?><toplevel><CompleteSuggestion><suggestion > > > data=\"שלום\"/>< (etc...) > > > > > > > > > > Instead of this output: > > > <?xml > version="1.0"?><toplevel><CompleteSuggestion><suggestion > data="שלום > > > "/><num_queries > int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion > > > data="שלום חנוך"/><num_queries > int="232000"/></CompleteSuggestion> > > > <CompleteSuggestion><suggestion data="שלום עליכם"/ > > > (etc....) > > > > > > > > > > > I tried: > > > readLines(u, encoding= "latin1") > > > readLines(u, encoding= "UTF-8") > > > And also changing Sys.setlocale: > > > Sys.setlocale("LC_ALL", "Hebrew") # must be done for > Hebrew to work. > > > Sys.setlocale("LC_ALL", "English") # must be done for > Hebrew to work. > > > > > > Are there any more options I could try to get this text > properly encoded? > > > > > > Thanks! > > > Tal > > > > > > > > > > > > ----------------Contact > > > > Details:------------------------------------------------------- > > > Contact me: tal.gal...@gmail.com | 972-52-7275845 > > > Read me: www.talgalili.com (Hebrew) | > www.biostatistics.co.il (Hebrew) | > > > www.r-statistics.com (English) > > > > > > > > ---------------------------------------------------------------------------------------------- > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > -- > Matthew S. Shotwell > Graduate Student > Division of Biostatistics and Epidemiology > Medical University of South Carolina > > > -- Matthew S. Shotwell Graduate Student Division of Biostatistics and Epidemiology Medical University of South Carolina ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.