*Hi all, I was trying to strip off all html tags and the special characters from a html file using regex. my code is as follows.. ________________________________________________________________________* use strict; use warnings;
sub strip_html{ my $line = shift; #something wrong in the following 2 lines.. :( $line =~ s/(<.*>)?//; $line =~ s/(&.*;)?//; print $line; } open my $file, "dict/index.html"; while(my $line = <$file>){ chomp $line ; strip_html $line; } close $file; *____________________________________ the input file is.. ______________________________________ * <HTML> <HEAD> <TITLE>The Online Plain Text English Dictionary</TITLE> <META NAME="AUTHOR" CONTENT="Ralph S. Sutherland"><META NAME="DESCRIPTION" CONTENT="The Online Plain Text English Dictionary, OPTED. v0.01a"><META NAME="KEYWORDS" CONTENT="dictionary, words, wordlist, english, definitions, plain text"><META NAME="DISTRIBUTION" CONTENT="global"> </HEAD> <BODY BGCOLOR="#FFFFFF"> <H2>The Online Plain Text English Dictionary</H2> <P>OPTED is a public domain English word list dictionary, based on the public domain portion of "The Project Gutenberg Etext of Webster's Unabridged Dictionary" which is in turn based on the 1913 US Webster's Unabridged Dictionary. (See <A HREF="http://www.promo.net/pg/">Project Gutenburg</A>)</P> <P>This version has been extensively stripped down and set out as one definition per line. All the Gutenburg EText tags and formatting have been removed by computer. Version 0.03 is a new processing of v0.47 of the websters dictionary and it has considerably fewer errors. Also the definition limit of 255 chars has been removed to give full justice of some of the more majestic of the originals. Some important errors in the parts-of-speech fields have been corrected and a lot of inflections/ alternatives and plurals that were missed due to software bugs in v0.01 and 0.02 are now included properly.</P> <P>The dictionary is set as a word list with definitions, using minimal HTML markup. The only tags used are <P>, <B> and <I> and these serve to delimit the words (between <B>s) the part of speech or type (between <I>s) and the definitions (The rest of the line). Each entry is between a <P>, </P> pair. This will facilitate computer processing. The text was prepared on a macintosh, so the few accented and umlauted characters appear best if your browser is set to Western MacRoman encoding (this should look like an umlauted u : <B>ü</B>). If this causes problems and I get enough responses, I'll look into producing an ISO 8859-1 or even a Unicode version.</P> <P>The dictionary can be viewed (with patience) directly online as you would a normal printed dictionary, otherwise a user can download the pages and process them in some way on their own machine. The only usage conditions are that if the material is redistributed, the content (not the formatting) remain in the public domain (ie free) and that the content be easily accessible in non-encoded plain text format at no cost to the end user. The origin of the content should also be acknowledged, including OPTED, Project Gutenburg and the 1913 edition of Webster's Unabridged Dictionary. If the material is to be included in commercial products, Project Gutenburg should be contacted first. There are no restrictions for personal or research uses of this material.</P> <H3>OPTED v0.03 by Letter(size)</H3> <P>Second computer generated version:</P> <P><A HREF="v003/wb1913_a.html">A(1.1M)</A> | <A HREF="v003/wb1913_b.html">B(1005k)</A> | <A HREF="v003/wb1913_c.html">C(1.6M)</A> | <A HREF="v003/wb1913_d.html">D(1M)</A> | <A HREF="v003/wb1913_e.html">E(809k)</A> | <A HREF="v003/wb1913_f.html">F(784k)</A> | <A HREF="v003/wb1913_g.html">G(564k)</A> | <A HREF="v003/wb1913_h.html">H(686k)</A> | <A HREF="v003/wb1913_i.html">I(833k)</A> | <A HREF="v003/wb1913_j.html">J(172k)</A> | <A HREF="v003/wb1913_k.html">K(172k)</A> | <A HREF="v003/wb1913_l.html">L(637k)</A> | <A HREF="v003/wb1913_m.html">M(931k)</A> | <A HREF="v003/wb1913_n.html">N(343k)</A> | <A HREF="v003/wb1913_o.html">O(466k)</A> | <A HREF="v003/wb1913_p.html">P(1.5M)</A> | <A HREF="v003/wb1913_q.html">Q(147k)</A> | <A HREF="v003/wb1913_r.html">R(931k)</A> | <A HREF="v003/wb1913_s.html">S(2.1M)</A> | <A HREF="v003/wb1913_t.html">T(1005k)</A> | <A HREF="v003/wb1913_u.html">U(343k)</A> | <A HREF="v003/wb1913_v.html">V(343k)</A> | <A HREF="v003/wb1913_w.html">W(490k)</A> | <A HREF="v003/wb1913_x.html">X(49k)</A> | <A HREF="v003/wb1913_y.html">Y(74k)</A> | <A HREF="v003/wb1913_z.html">Z(74k)</A></P> <H3>OPTED v0.03 by Archive</H3> <P>OPTED v0.03can also be downloaded as a <A HREF="optedv003.hqx">large stuffit/binhex encoded archive</A>. (7.1MB/ 19.1M unpacked)</P> </BODY> </HTML> *____________________________ And the output.. ___________________________ * <TITLE>The Online Plain Text English Dictionary</TITLE>OPTED is a public domain English word list dictionary, based onthe public domain portion of "The Project Gutenberg Etext ofWebster's Unabridged Dictionary" which is in turn based on the 1913US Webster's Unabridged Dictionary. (SeeThis version has been extensively stripped down and set out as onedefinition per line. All the Gutenburg EText tags and formatting havebeen removed by computer. Version 0.03 is a new processing of v0.47of the websters dictionary and it has considerably fewer errors. Alsothe definition limit of 255 chars has been removed to give fulljustice of some of the more majestic of the originals. Some importanterrors in the parts-of-speech fields have been corrected and a lot ofinflections/ alternatives and plurals that were missed due tosoftware bugs in v0.01 and 0.02 are now included properly.</P>The dictionary is set as a word list with definitions, usingminimal HTML markup. The only tags used are <P>, <B> ands)the part of speech or type (between <I>s) and the definitions(The rest of the line). Each entry is between a <P>, </P>pair. This will facilitate computer processing. The text was preparedon a macintosh, so the few accented and umlauted characters appearbest if your browser is set to Western MacRoman encoding (this shouldlook like an umlauted u : <B>ü</B>). If this causes problems andI get enough responses, I'll look into producing an ISO 8859-1 oreven a Unicode version.</P>The dictionary can be viewed (with patience) directly online asyou would a normal printed dictionary, otherwise a user can downloadthe pages and process them in some way on their own machine. The onlyusage conditions are that if the material is redistributed, thecontent (not the formatting) remain in the public domain (ie free)and that the content be easily accessible in non-encoded plain textformat at no cost to the end user. The origin of the content shouldalso be acknowledged, including OPTED, Project Gutenburg and the 1913edition of Webster's Unabridged Dictionary. If the material is to beincluded in commercial products, Project Gutenburg should becontacted first. There are no restrictions for personal or researchuses of this material.</P> | | | | | | | | | | | | | | | | | | | | | | | | |OPTED v0.03can also be downloaded as a.(7.1MB/ 19.1M unpacked)</P> *________________________* *Why is there no uniformity in matching?? What is wrong here??* *_____________________________________* * * * * *Thanks in advance,* *Somu.*