On Wed, 4 Aug 2004, Perl wrote:
I wrote some code to identify and print HTML tables below
Don't do that.
HTML is tremendously difficult to analyze properly with tools like regular expressions.
You're much, much better off using a proper parser library that can build up a tree model of the html that you can analyze as you like.
The standard libraries for this are probably HTML::Parser and HTML::Treebuilder. You may also like HTML::TableContentParser.
<http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm> <http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/TreeBuilder.pm> <http://search.cpan.org/~sdrabble/HTML-TableContentParser-0.13/TableContentParser.pm>
This may point you in a useful direction:
use HTML::TableContentParser; $p = HTML::TableContentParser->new(); $html = read_html_from_somewhere(); $tables = $p->parse($html); for $t (@$tables) { for $r (@{$t->{rows}}) { print "Row: "; for $c (@{$r->{cells}}) { print "[$c->{data}] "; } print "\n"; } }
Something like this should work even for godawful ms-html :-)
The problem I am stuck with is that now I want to sort the tables based on a Priority (which range from 1-3). There may be several tables with the same priority numbers. An example of a Priority 3 would be:
# extraordinarily ugly html omitted
I need help in understanding the methodology in how to extract these 2 items and then sort the tables in Priority order (all the 1's, 2's and 3's).
It looks like HTML::TableContentParser makes sorting through the structure of the table pretty easy; HTML::Parser could go farther by reducing it down to just the printable text -- some combination of the two may be useful here.
Once you've stripped out all the junk (all the span tags, the paragraph tags, the "<o:p></o:p>" type debris, etc), you just need to do convert the html structure into some kind of populated data structure.
You didn't give enough of the html to suggest what the rest of the table is structured like -- it was really just one big hairy table cell -- so it's hard to guess how the other pieces fit together.
Can you post a simpler example of what the table is built like, e.g.:
+------------+-------+---------------+----------------+ | priority 1 | field | another field | some more | +------------+-------+---------------+----------------+ | priority 3 | field | any data here | other things | +------------+-------+---------------+----------------+ | priority 2 | field | stuff stuff | whatever | +------------+-------+---------------+----------------+
Or is it more complcated than that?
-- Chris Devers [EMAIL PROTECTED] http://devers.homeip.net:8080/blog/
np: 'Lujon' by Henry Mancini from 'The Best Of Mancini'
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>