Hi Som,

Looks like you want to do the minimal match, so you can refer to the code:

$line =~ s/(<.*>)?//;
=>
$line =~ s/<.*?>//g;

But there is still a problem,you have '<' and '>' placing in different
lines, so you can try to read all the file content into a variable, and
replace them once for all.

Du Zheng

2012/4/14 Somu <som....@gmail.com>

> *Hi all,
> I was trying to strip off all html tags and the special characters from a
> html file using regex.
> my code is as follows..
> ________________________________________________________________________*
> use strict;
> use warnings;
>
> sub strip_html{
> my $line = shift;
> #something wrong in the following 2 lines.. :(
> $line =~ s/(<.*>)?//;
> $line =~ s/(&.*;)?//;
> print $line;
> }
>
> open my $file, "dict/index.html";
> while(my $line = <$file>){
> chomp $line ;
> strip_html $line;
> }
> close $file;
> *____________________________________
>
> the input file is..
> ______________________________________
>
> *
> <HTML>
> <HEAD>
>   <TITLE>The Online Plain Text English Dictionary</TITLE>
> <META NAME="AUTHOR" CONTENT="Ralph S. Sutherland"><META NAME="DESCRIPTION"
> CONTENT="The Online Plain Text English Dictionary, OPTED. v0.01a"><META
> NAME="KEYWORDS" CONTENT="dictionary, words,  wordlist, english,
> definitions, plain text"><META NAME="DISTRIBUTION" CONTENT="global">
> </HEAD>
> <BODY BGCOLOR="#FFFFFF">
>
> <H2>The Online Plain Text English Dictionary</H2>
>
> <P>OPTED is a public domain English word list dictionary, based on
> the public domain portion of "The Project Gutenberg Etext of
> Webster's Unabridged Dictionary" which is in turn based on the 1913
> US Webster's Unabridged Dictionary. (See
> <A HREF="http://www.promo.net/pg/";>Project Gutenburg</A>)</P>
>
> <P>This version has been extensively stripped down and set out as one
> definition per line. All the Gutenburg EText tags and formatting have
> been removed by computer. Version 0.03 is a new processing of v0.47
> of the websters dictionary and it has considerably fewer errors. Also
> the definition limit of 255 chars has been removed to give full
> justice of some of the more majestic of the originals. Some important
> errors in the parts-of-speech fields have been corrected and a lot of
> inflections/ alternatives and plurals that were missed due to
> software bugs in v0.01 and 0.02 are now included properly.</P>
>
> <P>The dictionary is set as a word list with definitions, using
> minimal HTML markup. The only tags used are &lt;P&gt;, &lt;B&gt; and
> &lt;I&gt; and these serve to delimit the words (between &lt;B&gt;s)
> the part of speech or type (between &lt;I&gt;s) and the definitions
> (The rest of the line). Each entry is between a &lt;P&gt;, &lt;/P&gt;
> pair. This will facilitate computer processing. The text was prepared
> on a macintosh, so the few accented and umlauted characters appear
> best if your browser is set to Western MacRoman encoding (this should
> look like an umlauted u : <B>&uuml;</B>). If this causes problems and
> I get enough responses, I'll look into producing an ISO 8859-1 or
> even a Unicode version.</P>
>
> <P>The dictionary can be viewed (with patience) directly online as
> you would a normal printed dictionary, otherwise a user can download
> the pages and process them in some way on their own machine. The only
> usage conditions are that if the material is redistributed, the
> content (not the formatting) remain in the public domain (ie free)
> and that the content be easily accessible in non-encoded plain text
> format at no cost to the end user. The origin of the content should
> also be acknowledged, including OPTED, Project Gutenburg and the 1913
> edition of Webster's Unabridged Dictionary. If the material is to be
> included in commercial products, Project Gutenburg should be
> contacted first. There are no restrictions for personal or research
> uses of this material.</P>
>
> <H3>OPTED v0.03 by Letter(size)</H3>
>
> <P>Second computer generated version:</P>
>
> <P><A HREF="v003/wb1913_a.html">A(1.1M)</A> |
> <A HREF="v003/wb1913_b.html">B(1005k)</A> |
> <A HREF="v003/wb1913_c.html">C(1.6M)</A> |
> <A HREF="v003/wb1913_d.html">D(1M)</A> |
> <A HREF="v003/wb1913_e.html">E(809k)</A> |
> <A HREF="v003/wb1913_f.html">F(784k)</A> |
> <A HREF="v003/wb1913_g.html">G(564k)</A> |
> <A HREF="v003/wb1913_h.html">H(686k)</A> |
> <A HREF="v003/wb1913_i.html">I(833k)</A> |
> <A HREF="v003/wb1913_j.html">J(172k)</A> |
> <A HREF="v003/wb1913_k.html">K(172k)</A> |
> <A HREF="v003/wb1913_l.html">L(637k)</A> |
> <A HREF="v003/wb1913_m.html">M(931k)</A> |
> <A HREF="v003/wb1913_n.html">N(343k)</A> |
> <A HREF="v003/wb1913_o.html">O(466k)</A> |
> <A HREF="v003/wb1913_p.html">P(1.5M)</A> |
> <A HREF="v003/wb1913_q.html">Q(147k)</A> |
> <A HREF="v003/wb1913_r.html">R(931k)</A> |
> <A HREF="v003/wb1913_s.html">S(2.1M)</A> |
> <A HREF="v003/wb1913_t.html">T(1005k)</A> |
> <A HREF="v003/wb1913_u.html">U(343k)</A> |
> <A HREF="v003/wb1913_v.html">V(343k)</A> |
> <A HREF="v003/wb1913_w.html">W(490k)</A> |
> <A HREF="v003/wb1913_x.html">X(49k)</A> |
> <A HREF="v003/wb1913_y.html">Y(74k)</A> |
> <A HREF="v003/wb1913_z.html">Z(74k)</A></P>
>
> <H3>OPTED v0.03 by Archive</H3>
>
> <P>OPTED v0.03can also be downloaded as a
> <A HREF="optedv003.hqx">large stuffit/binhex encoded archive</A>.
> (7.1MB/ 19.1M unpacked)</P>
>
>
> </BODY>
> </HTML>
> *____________________________
>
> And the output..
>
> ___________________________
>
> *
>   <TITLE>The Online Plain Text English Dictionary</TITLE>OPTED is a public
> domain English word list dictionary, based onthe public domain portion of
> "The Project Gutenberg Etext ofWebster's Unabridged Dictionary" which is in
> turn based on the 1913US Webster's Unabridged Dictionary. (SeeThis version
> has been extensively stripped down and set out as onedefinition per line.
> All the Gutenburg EText tags and formatting havebeen removed by computer.
> Version 0.03 is a new processing of v0.47of the websters dictionary and it
> has considerably fewer errors. Alsothe definition limit of 255 chars has
> been removed to give fulljustice of some of the more majestic of the
> originals. Some importanterrors in the parts-of-speech fields have been
> corrected and a lot ofinflections/ alternatives and plurals that were
> missed due tosoftware bugs in v0.01 and 0.02 are now included
> properly.</P>The dictionary is set as a word list with definitions,
> usingminimal HTML markup. The only tags used are &lt;P&gt;, &lt;B&gt;
> ands)the part of speech or type (between &lt;I&gt;s) and the
> definitions(The rest of the line). Each entry is between a &lt;P&gt;,
> &lt;/P&gt;pair. This will facilitate computer processing. The text was
> preparedon a macintosh, so the few accented and umlauted characters
> appearbest if your browser is set to Western MacRoman encoding (this
> shouldlook like an umlauted u : <B>&uuml;</B>). If this causes problems
> andI get enough responses, I'll look into producing an ISO 8859-1 oreven a
> Unicode version.</P>The dictionary can be viewed (with patience) directly
> online asyou would a normal printed dictionary, otherwise a user can
> downloadthe pages and process them in some way on their own machine. The
> onlyusage conditions are that if the material is redistributed, thecontent
> (not the formatting) remain in the public domain (ie free)and that the
> content be easily accessible in non-encoded plain textformat at no cost to
> the end user. The origin of the content shouldalso be acknowledged,
> including OPTED, Project Gutenburg and the 1913edition of Webster's
> Unabridged Dictionary. If the material is to beincluded in commercial
> products, Project Gutenburg should becontacted first. There are no
> restrictions for personal or researchuses of this material.</P> | | | | | |
> | | | | | | | | | | | | | | | | | | |OPTED v0.03can also be downloaded as
> a.(7.1MB/ 19.1M unpacked)</P>
>
>
> *________________________*
> *Why is there no uniformity in matching?? What is wrong here??*
> *_____________________________________*
> *
> *
> *
> *
> *Thanks in advance,*
> *Somu.*
>

Reply via email to