On Sun, 12 Feb 2012 16:24:58 -0200, Nilza BARROS wrote:
> I really appreciate your help. I definitively need a reusable program since > I have been asking to someone to extract these data from the Internet > everyday. That's the reason why I am trying to do a program to do that > Related to the url I sent, I have just realized that although I had written > the one related to only worksheet (PLANILHA2) when I copy it to my browse > it is showed the link with both worksheets. > > I am going to read about Rcurl and XML libraries but I hope you can help me > too. Hi again, Nilza. I looked over this to see if there was some simpler way of doing this; I couldn't find one. The main issue I see is that this is "HTML" generated from Excel. That means it's got a lot of "features" for navigation, formatting, and such built into the script that make it a pain to parse. I tried parsing it with both the XML R package (look at the htmlTreeParse() function) and with other non-R tools like scrapy and llxml in Python. If you read this post after you had a peek at the XML package, my next explanation will make more sense. The main issue is that the DOM you're analyzing has child nodes that are either generated or repopulated from the Excel data dumped onto the file system. Walking this DOM requires not only XML, but perhaps even having a JavaScript parser resolve some of the nodes before you get the information you want. That's why, doing it from the main page, is a nasty issue. My suggestion, at this time, would be to focus on seeing if you can parse the individual sub-sheets. Although you may to load each manually, their DOM appears simpler and with less crap specific to JavaScript/CSS/Excel/Internet Explorer. Being cleaner, they should be easier to parse with XML or any other tool. I'll take another look at the individual sheets to check if they in fact have a simpler document model. Cheers! pr3d4t0r -- pr3d4t0r at #R, ##java, #awk, #pyton irc.freeenode.net [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.