On 25/07/2011 21:17, Jeffrey Joh wrote:
> 
> Hello, I'm trying to parse HTML files.  I want to extract values from
> tables (1) and from text fields (2).  (1)<tr><td><img
> src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
>
> <tr>
>   <td align="right" valign="top"><b>Floor plan:</b></td>
>   <td>
>     Ranch #1</td>
> </tr>   (2)
> <input type="text" name="date_constructed" id="date_constructed" 
> value="04/01/2004" size="10" disabled>  I would want to retrieve the floor 
> plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file 
> (along with many other text boxes).  What is an easy way of doing that? Jeff  
>                                   

Hello Jeff

I am unclear what you want to do. The HTML fragments you have shown are
syntactically incorrect, and in any case are irrelevant out of the
context of a complete HTML document.

However I think I can help a little. The HTML::TreeBuilder module will
build an HTML::Element object for you that you can navigate, modify, and
extract data from. It is very forgiving of incorrect syntax, and will
try to build a complete HTML document from any fragment that you offer it.

The program below seems to do what you want, but without testing against
the complete data that you are dealing with I cannot vouch for its
correctness. In particular you should add checks to verify that the HTML
you are working with looks as you expect it to. I have written a couple
such checks, but only you can improve on those.

HTH,

Rob


use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

print "Working from HTML:\n\n";
print $tree->as_HTML(undef, '  '), "\n\n";

# Find an <input> element with an 'id' atttribute of 'date_constructed'
# (there should be only one). The date required comes from the 'value'
# attribute of that element.
#
my $date_tr = $tree->look_down(
  _tag => 'input',
  id   => 'date_constructed',
)
or die "No construction date";
my $plan_date = $date_tr->attr('value');

# Now look up the tree to the containing <tr> element, and find its previous
# sibling <tr> which contains the floor plan text in the second <td> child
# element
#
my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
my @tds = $plan_tr->look_down(_tag => 'td');
die "Unexpected format" unless @tds == 2;

my $plan_text = $tds[1]->as_trimmed_text;

print "Plan found: $plan_text on $plan_date\n";

__DATA__
<tr>
 <td align="right" valign="top"><b>Floor plan:</b></td>
 <td>
   Ranch #1  </td> 
</tr>
<input type="text" name="date_constructed" id="date_constructed" 
value="04/01/2004" size="10" disabled>

**OUTPUT**

Working from HTML:

<html>
  <head>
  </head>
  <body>
    <table>
      <tr>
        <td align="right" valign="top"><b>Floor plan:</b></td>
        <td> Ranch #1 </td>
      </tr>
      <tr>
        <td><input disabled id="date_constructed" name="date_constructed" 
size="10" type="text" value="04/01/2004" /></td>
      </tr>
    </table>
  </body>
</html>

Plan found: Ranch #1 on 04/01/2004

Tool completed successfully



-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to