On Tue, 24 Apr 2001 [EMAIL PROTECTED] wrote:
> Thank you very much for all the great help I received earlier on extracting
> numbers from text. There is only one thing I forgot about:
>
> Some of the files have HTML headers and footers. I don't want any data
> inside HTML brackets. I tried:
>
> s/<*>//g;
>
> I don't understand why this doesn't work.
This regular expresssion would read, "Zero or more '<' characters followed
by exactly one '>' character". The '*' actually means, "The character
that came before this '*' will be matched zero or more times". If you
want to strip out HTML tags, you'll want to do this:
s/<.*?>//g;
There's two differences here. The first is the addition of the period
(".") which means, match one character, no matter what it is. The second
is the question mark after the '*' which means, "Grab the fewest number of
characters I can that still satisfies the match". This prevents the
substitution from gobbling up everything between the very first < and the
very last >.
One other way to write this is like this:
s/<[^>]+>//g;
Here, you say, Match any string that:
1. Starts with < (This is represented by < in the expression)
2. Consists of one or more non-">" character (This is represented by [^>]+
in the expression)
3. Followed by one >
This is the way I'd do it; it's much more straightforward (at least to
me).
- D
<[EMAIL PROTECTED]>