Hi all,

I have to parse some thousand of html files, so I'd like to use some
html parser, and not my own regexpes. Htmls I am parsing are quite
complex, so I need your help. First of all, is HTML::Tree good and
fast module?

Because, I am not sure if I have to look for some criteria using
if( my $h = $tree->look_down('_tag', 'sometag') ) { }
it is not slow ?

When I used Dumped through Data::Dumper, from 300 kb html file is 13mb
dump output...

Ok, and now to the problem, html looks like:

<table width="600%" border="3" align="center" cellspacing="2" cellpadding="2" 
bgcolor='#eeffff'>
 <tr>
   <td align="left" valign="top" width="20%"> <span 
class="tl">TEST:&nbsp;</span></td>
   <td align="left" width="80%"><table width="100%" border="0">
   <tr>
    <td width="67%"> <span class='ra'>  Vysoká </span> <span class='ra'>  9 
</span><br> <span class='ra'>  Bratislava </span> <span class='ra'>  810 00 
</span><br></td>
    <td width="33%" valign='top'>&nbsp; <span class='ra'>something</span></td>
  </tr>
  </table><table width="100%" border="0">
   <tr>
   <td width="67%"> <span class='ro'>  Nám. SNP </span> <span class='ro'>  15 
</span><br> <span class='ro'>  Bratislava </span> <span class='ro'>  810 00 
</span><br></td>
   <td width="33%" valign='top'>&nbsp; <span class='ro'>something</span></td>
  </tr>
  </table><table width="100%" border="0">
   <tr>
   <td width="67%"> <span class='ro'>  Bratislava </span><br></td>
   <td width="33%" valign='top'>&nbsp; <span class='ro'>something</span></td>
  </tr>
  </table></td>
</tr>
</table>

(I hope you will see it ok, if not http://www.2ge.us/perl/html.txt ).

Ok, and now to the problem - nearly whole html is full of this kind
tables. And now how to extract values from there ? I have to look out,
if class = "tl" and value is /TEST:/i, if yes, give me all values till
end of whole table. Should be someone so neat and give me some help ?
Hint: in table is always one class='ra' and optional 0 or more
class='ro'

thanks for any help!

--

 --. ,--  ,-     ICQ: 7552083      \|||/    `//EB: www.2ge.us
,--' |  - |--    IRC: [2ge]        (. .)    ,\\SN: 2ge!2ge_us
`====+==+=+===~  ~=============-o00-(_)-00o-================~
John Tesh might drive (John says ride) a Celica.
 




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to