On Sat, 24 Jan 2004, Marcelo wrote:

>  Which regular expression would you use to remove the <title> and 
> </title> from a line like this one:
> 
> <title>Here goes a webpage's title</title>
> 
> Thanks a lot in advance.
> 

Did you what that _exact_ input? I.e. always <title>...</title>? If so, 
that's rather easy.

$line =~ s/<title>(.*)<\/title>/$1/

Now, if you want the more general form of <any_tag>...</any_tag>, that is 
removing paired HTML tags, that's more difficult. Luckily, it is an 
example in "Programming PERL, 3rd Edition" on page 184 which is close.

line =~ s/(<.*?>)(.*?)(?:</\1>)/$2/

In sort-of English. This says:

Match starting with a < and ending with the next >, calling it $1 (or \1). 
Now, match everything up to the next < and call it $2. Now match a < 
followed by a /, followed by what you matched first (in $1 or \1), 
followed by a >. Now, replace all of that with $2.

A problem with this pattern is that it would not work as you would 
like want it to with input such as:

<title><B>Title</B></title>

You'd end up removing the <B> and </B>, but leaving the <title> and 
</title>. Of course, if your desire is to remove all paired HTML tags, 
then put this in a loop until it no longer matches.

HTH,

--
Maranatha!
John McKown



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to