Hi Som, Looks like you want to do the minimal match, so you can refer to the code:
$line =~ s/(<.*>)?//; => $line =~ s/<.*?>//g; But there is still a problem,you have '<' and '>' placing in different lines, so you can try to read all the file content into a variable, and replace them once for all. Du Zheng 2012/4/14 Somu <som....@gmail.com> > *Hi all, > I was trying to strip off all html tags and the special characters from a > html file using regex. > my code is as follows.. > ________________________________________________________________________* > use strict; > use warnings; > > sub strip_html{ > my $line = shift; > #something wrong in the following 2 lines.. :( > $line =~ s/(<.*>)?//; > $line =~ s/(&.*;)?//; > print $line; > } > > open my $file, "dict/index.html"; > while(my $line = <$file>){ > chomp $line ; > strip_html $line; > } > close $file; > *____________________________________ > > the input file is.. > ______________________________________ > > * > <HTML> > <HEAD> > <TITLE>The Online Plain Text English Dictionary</TITLE> > <META NAME="AUTHOR" CONTENT="Ralph S. Sutherland"><META NAME="DESCRIPTION" > CONTENT="The Online Plain Text English Dictionary, OPTED. v0.01a"><META > NAME="KEYWORDS" CONTENT="dictionary, words, wordlist, english, > definitions, plain text"><META NAME="DISTRIBUTION" CONTENT="global"> > </HEAD> > <BODY BGCOLOR="#FFFFFF"> > > <H2>The Online Plain Text English Dictionary</H2> > > <P>OPTED is a public domain English word list dictionary, based on > the public domain portion of "The Project Gutenberg Etext of > Webster's Unabridged Dictionary" which is in turn based on the 1913 > US Webster's Unabridged Dictionary. (See > <A HREF="http://www.promo.net/pg/">Project Gutenburg</A>)</P> > > <P>This version has been extensively stripped down and set out as one > definition per line. All the Gutenburg EText tags and formatting have > been removed by computer. Version 0.03 is a new processing of v0.47 > of the websters dictionary and it has considerably fewer errors. Also > the definition limit of 255 chars has been removed to give full > justice of some of the more majestic of the originals. Some important > errors in the parts-of-speech fields have been corrected and a lot of > inflections/ alternatives and plurals that were missed due to > software bugs in v0.01 and 0.02 are now included properly.</P> > > <P>The dictionary is set as a word list with definitions, using > minimal HTML markup. The only tags used are <P>, <B> and > <I> and these serve to delimit the words (between <B>s) > the part of speech or type (between <I>s) and the definitions > (The rest of the line). Each entry is between a <P>, </P> > pair. This will facilitate computer processing. The text was prepared > on a macintosh, so the few accented and umlauted characters appear > best if your browser is set to Western MacRoman encoding (this should > look like an umlauted u : <B>ü</B>). If this causes problems and > I get enough responses, I'll look into producing an ISO 8859-1 or > even a Unicode version.</P> > > <P>The dictionary can be viewed (with patience) directly online as > you would a normal printed dictionary, otherwise a user can download > the pages and process them in some way on their own machine. The only > usage conditions are that if the material is redistributed, the > content (not the formatting) remain in the public domain (ie free) > and that the content be easily accessible in non-encoded plain text > format at no cost to the end user. The origin of the content should > also be acknowledged, including OPTED, Project Gutenburg and the 1913 > edition of Webster's Unabridged Dictionary. If the material is to be > included in commercial products, Project Gutenburg should be > contacted first. There are no restrictions for personal or research > uses of this material.</P> > > <H3>OPTED v0.03 by Letter(size)</H3> > > <P>Second computer generated version:</P> > > <P><A HREF="v003/wb1913_a.html">A(1.1M)</A> | > <A HREF="v003/wb1913_b.html">B(1005k)</A> | > <A HREF="v003/wb1913_c.html">C(1.6M)</A> | > <A HREF="v003/wb1913_d.html">D(1M)</A> | > <A HREF="v003/wb1913_e.html">E(809k)</A> | > <A HREF="v003/wb1913_f.html">F(784k)</A> | > <A HREF="v003/wb1913_g.html">G(564k)</A> | > <A HREF="v003/wb1913_h.html">H(686k)</A> | > <A HREF="v003/wb1913_i.html">I(833k)</A> | > <A HREF="v003/wb1913_j.html">J(172k)</A> | > <A HREF="v003/wb1913_k.html">K(172k)</A> | > <A HREF="v003/wb1913_l.html">L(637k)</A> | > <A HREF="v003/wb1913_m.html">M(931k)</A> | > <A HREF="v003/wb1913_n.html">N(343k)</A> | > <A HREF="v003/wb1913_o.html">O(466k)</A> | > <A HREF="v003/wb1913_p.html">P(1.5M)</A> | > <A HREF="v003/wb1913_q.html">Q(147k)</A> | > <A HREF="v003/wb1913_r.html">R(931k)</A> | > <A HREF="v003/wb1913_s.html">S(2.1M)</A> | > <A HREF="v003/wb1913_t.html">T(1005k)</A> | > <A HREF="v003/wb1913_u.html">U(343k)</A> | > <A HREF="v003/wb1913_v.html">V(343k)</A> | > <A HREF="v003/wb1913_w.html">W(490k)</A> | > <A HREF="v003/wb1913_x.html">X(49k)</A> | > <A HREF="v003/wb1913_y.html">Y(74k)</A> | > <A HREF="v003/wb1913_z.html">Z(74k)</A></P> > > <H3>OPTED v0.03 by Archive</H3> > > <P>OPTED v0.03can also be downloaded as a > <A HREF="optedv003.hqx">large stuffit/binhex encoded archive</A>. > (7.1MB/ 19.1M unpacked)</P> > > > </BODY> > </HTML> > *____________________________ > > And the output.. > > ___________________________ > > * > <TITLE>The Online Plain Text English Dictionary</TITLE>OPTED is a public > domain English word list dictionary, based onthe public domain portion of > "The Project Gutenberg Etext ofWebster's Unabridged Dictionary" which is in > turn based on the 1913US Webster's Unabridged Dictionary. (SeeThis version > has been extensively stripped down and set out as onedefinition per line. > All the Gutenburg EText tags and formatting havebeen removed by computer. > Version 0.03 is a new processing of v0.47of the websters dictionary and it > has considerably fewer errors. Alsothe definition limit of 255 chars has > been removed to give fulljustice of some of the more majestic of the > originals. Some importanterrors in the parts-of-speech fields have been > corrected and a lot ofinflections/ alternatives and plurals that were > missed due tosoftware bugs in v0.01 and 0.02 are now included > properly.</P>The dictionary is set as a word list with definitions, > usingminimal HTML markup. The only tags used are <P>, <B> > ands)the part of speech or type (between <I>s) and the > definitions(The rest of the line). Each entry is between a <P>, > </P>pair. This will facilitate computer processing. The text was > preparedon a macintosh, so the few accented and umlauted characters > appearbest if your browser is set to Western MacRoman encoding (this > shouldlook like an umlauted u : <B>ü</B>). If this causes problems > andI get enough responses, I'll look into producing an ISO 8859-1 oreven a > Unicode version.</P>The dictionary can be viewed (with patience) directly > online asyou would a normal printed dictionary, otherwise a user can > downloadthe pages and process them in some way on their own machine. The > onlyusage conditions are that if the material is redistributed, thecontent > (not the formatting) remain in the public domain (ie free)and that the > content be easily accessible in non-encoded plain textformat at no cost to > the end user. The origin of the content shouldalso be acknowledged, > including OPTED, Project Gutenburg and the 1913edition of Webster's > Unabridged Dictionary. If the material is to beincluded in commercial > products, Project Gutenburg should becontacted first. There are no > restrictions for personal or researchuses of this material.</P> | | | | | | > | | | | | | | | | | | | | | | | | | |OPTED v0.03can also be downloaded as > a.(7.1MB/ 19.1M unpacked)</P> > > > *________________________* > *Why is there no uniformity in matching?? What is wrong here??* > *_____________________________________* > * > * > * > * > *Thanks in advance,* > *Somu.* >