On Sun, 12 Feb 2012 16:24:58 -0200, Nilza BARROS wrote: 

> I
really appreciate your help. I definitively need a reusable program
since
> I have been asking to someone to extract these data from the
Internet
> everyday. That's the reason why I am trying to do a program
to do that
> Related to the url I sent, I have just realized that
although I had written
> the one related to only worksheet (PLANILHA2)
when I copy it to my browse
> it is showed the link with both
worksheets.
> 
> I am going to read about Rcurl and XML libraries but I
hope you can help me
> too.

Hi again, Nilza. 

I looked over this to
see if there was some simpler way of doing this; I couldn't find one.


The main issue I see is that this is "HTML" generated from Excel. That
means it's got a lot of "features" for navigation, formatting, and such
built into the script that make it a pain to parse. I tried parsing it
with both the XML R package (look at the htmlTreeParse() function) and
with other non-R tools like scrapy and llxml in Python. 

If you read
this post after you had a peek at the XML package, my next explanation
will make more sense. 

The main issue is that the DOM you're analyzing
has child nodes that are either generated or repopulated from the Excel
data dumped onto the file system. Walking this DOM requires not only
XML, but perhaps even having a JavaScript parser resolve some of the
nodes before you get the information you want. That's why, doing it from
the main page, is a nasty issue. 

My suggestion, at this time, would be
to focus on seeing if you can parse the individual sub-sheets. Although
you may to load each manually, their DOM appears simpler and with less
crap specific to JavaScript/CSS/Excel/Internet Explorer. Being cleaner,
they should be easier to parse with XML or any other tool. I'll take
another look at the individual sheets to check if they in fact have a
simpler document model. 

Cheers! 

pr3d4t0r 

-- 
pr3d4t0r at #R,
##java, #awk, #pyton
irc.freeenode.net
  
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to