On Mon, 2001-11-12 at 16:31, Steve Tattersall wrote:
> For example I want to extract the line: (see the html code below)
>  GB 0152 MSS.126/NUDL
> 
> and also the title which is:
> 
> National Union of Dock, Riverside and General Workers in Grea
> t Britain and Ireland
> 
> does anyone know how to go about this please, I would be extremly grateful.

We've had a regexp answer, but for readability I'd use the
HTML::TokeParser module.  It'd work like this.

# Prep an object.  $html contains the html to parse.
my $p = HTML::TokeParser->new( \$html ) or die "$!";

# Find an <a> tag, and get everything outside of it up to </a>.
my $token = $p->get_tag("a");
my $reference = $p->get_trimmed_text("/a");

# From there, find a </b> tag, and snarf everything up to <br>.
my $token = $p->get_tag("/b");
my $title = $p->get_trimmed_text("br");

You'll have some small tidying up to do on both, but it's a /much/ more
readable (and maintainable) way of parsing the HTML.

Hope this helps, (from one Manchester perl bod to another ;-)

~C.

-- 
$a="printf.net"; Chris Ball | chris@void.$a | www.$a | finger: chris@$a
         "In the beginning there was nothing, which exploded."          


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to