On Tue, 24 Apr 2001 [EMAIL PROTECTED] wrote:

> Thank you very much for all the great help I received earlier on extracting
> numbers from text.  There is only one thing I forgot about:
> 
> Some of the files have HTML headers and footers.  I don't want any data
> inside HTML brackets.   I tried:
> 
>   s/<*>//g; 
> 
> I don't understand why this doesn't work. 

This regular expresssion would read, "Zero or more '<' characters followed
by exactly one '>' character".  The '*' actually means, "The character
that came before this '*' will be matched zero or more times".  If you
want to strip out HTML tags, you'll want to do this:

s/<.*?>//g;

There's two differences here.  The first is the addition of the period
(".") which means, match one character, no matter what it is.  The second
is the question mark after the '*' which means, "Grab the fewest number of
characters I can that still satisfies the match".  This prevents the
substitution from gobbling up everything between the very first < and the
very last >.

One other way to write this is like this:

s/<[^>]+>//g;

Here, you say, Match any string that:

1. Starts with < (This is represented by < in the expression)
2. Consists of one or more non-">" character (This is represented by [^>]+
in the expression)
3. Followed by one >

This is the way I'd do it; it's much more straightforward (at least to
me).

- D

<[EMAIL PROTECTED]>

Reply via email to