On 25/07/2011 21:17, Jeffrey Joh wrote: > > Hello, I'm trying to parse HTML files. I want to extract values from > tables (1) and from text fields (2). (1)<tr><td><img > src="/image.gif" alt="" width="1" height="1" border="0"></td></tr> > > <tr> > <td align="right" valign="top"><b>Floor plan:</b></td> > <td> > Ranch #1</td> > </tr> (2) > <input type="text" name="date_constructed" id="date_constructed" > value="04/01/2004" size="10" disabled> I would want to retrieve the floor > plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file > (along with many other text boxes). What is an easy way of doing that? Jeff >
Hello Jeff I am unclear what you want to do. The HTML fragments you have shown are syntactically incorrect, and in any case are irrelevant out of the context of a complete HTML document. However I think I can help a little. The HTML::TreeBuilder module will build an HTML::Element object for you that you can navigate, modify, and extract data from. It is very forgiving of incorrect syntax, and will try to build a complete HTML document from any fragment that you offer it. The program below seems to do what you want, but without testing against the complete data that you are dealing with I cannot vouch for its correctness. In particular you should add checks to verify that the HTML you are working with looks as you expect it to. I have written a couple such checks, but only you can improve on those. HTH, Rob use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_file(*DATA); print "Working from HTML:\n\n"; print $tree->as_HTML(undef, ' '), "\n\n"; # Find an <input> element with an 'id' atttribute of 'date_constructed' # (there should be only one). The date required comes from the 'value' # attribute of that element. # my $date_tr = $tree->look_down( _tag => 'input', id => 'date_constructed', ) or die "No construction date"; my $plan_date = $date_tr->attr('value'); # Now look up the tree to the containing <tr> element, and find its previous # sibling <tr> which contains the floor plan text in the second <td> child # element # my $plan_tr = $date_tr->look_up(_tag => 'tr')->left; my @tds = $plan_tr->look_down(_tag => 'td'); die "Unexpected format" unless @tds == 2; my $plan_text = $tds[1]->as_trimmed_text; print "Plan found: $plan_text on $plan_date\n"; __DATA__ <tr> <td align="right" valign="top"><b>Floor plan:</b></td> <td> Ranch #1 </td> </tr> <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled> **OUTPUT** Working from HTML: <html> <head> </head> <body> <table> <tr> <td align="right" valign="top"><b>Floor plan:</b></td> <td> Ranch #1 </td> </tr> <tr> <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="04/01/2004" /></td> </tr> </table> </body> </html> Plan found: Ranch #1 on 04/01/2004 Tool completed successfully -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/