It doesn't work because you check for the beginning and the end <head> tags in the same line, but you have splitted the text in more lines, and the <head> can be found in a line but the </head> tag on another line probably.
You can use something like this: $html =~ s|<head[^>]*>.*?</head[^>]*>||si; So you don't need to check each line separately. It works for the head tag but it doesn't work for all the tags, because you might have a page like this: <table> <tr><td>a1</td> <td>a2</td> <td>a3</td></tr> <tr><td>b1</td> <td> <table> ... </table> </td> <td>b3</td></tr> </table> If you will want to delete everything from <table> until its corresponding </table> it is more complicated because if you will use that simple regular expression it will delete until the first </table> but not until the second one. ----- Original Message ----- From: "Sara" <[EMAIL PROTECTED]> To: "Sara" <[EMAIL PROTECTED]>; "Hanson, Rob" <[EMAIL PROTECTED]>; "'Wiggins d'Anconia'" <[EMAIL PROTECTED]> Cc: "beginperl" <[EMAIL PROTECTED]> Sent: Wednesday, September 03, 2003 4:08 PM Subject: [Addendum] Stripping HTML from a text file. > Okay, when I finally implemented this, again its not working ... any ideas? > simple regex like s/<head>/<blah>/g; is working but not this one? > > ---------------------------------------------------- > #!/usr/bin/perl > > use LWP::Simple; > > print "Content-type: text/html\n\n"; > > $url = 'http://yahoo.com'; > > $html = get($url); > > @html = split(/\n/,$html); > > foreach $line(@html) { > > chomp ($line); > > $line =~ s|<head>.*?<\/head>||s; > > print "$line\n"; > > } > ######################################### > > > > Thanks for any input. > > Sara. > > > > ----- Original Message ----- > From: "Sara" <[EMAIL PROTECTED]> > To: "Hanson, Rob" <[EMAIL PROTECTED]>; "'Wiggins d'Anconia'" > <[EMAIL PROTECTED]> > Cc: "beginperl" <[EMAIL PROTECTED]> > Sent: Wednesday, September 03, 2003 4:34 PM > Subject: Re: Stripping HTML from a text file. > > > : Thanks a lot Hanson, > : > : It worked for me. > : > : Yep, you are right "The regex way is good for quick and dirty HTML work." > : > : and especially for the newbies like me :)) > : > : Sara. > : > : > : ----- Original Message ----- > : From: "Hanson, Rob" <[EMAIL PROTECTED]> > : To: "'Wiggins d'Anconia'" <[EMAIL PROTECTED]>; "'Sara'" > : <[EMAIL PROTECTED]> > : Cc: "beginperl" <[EMAIL PROTECTED]> > : Sent: Friday, September 05, 2003 5:55 AM > : Subject: RE: Stripping HTML from a text file. > : > : > : : > Or maybe I misunderstood the question > : : > : : Or maybe I did :) > : : > : : > HTML::TokeParser::Simple > : : > : : I agree... but only if you are looking for a strong permanant solution. > : The > : : regex way is good for quick and dirty HTML work. > : : > : : Sara, if you need to keep the <head> tags, then you could use this > : modified > : : version... > : : > : : # untested > : : $text = "..."; > : : $text =~ s|(<head>).*?(</head>)|$1$2|s; > : : > : : ...Or if you wanted to keep the <title> tag... > : : > : : # untested > : : $text = "..."; > : : $text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s; > : : > : : Rob > : : > : : -----Original Message----- > : : From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED] > : : Sent: Thursday, September 04, 2003 8:48 PM > : : To: 'Sara' > : : Cc: beginperl > : : Subject: Re: Stripping HTML from a text file. > : : > : : > : : Won't this remove *everything* between the given tags? Or maybe I > : : misunderstood the question, I thought she wanted to remove the "code" > : : from all of the contents between two tags? > : : > : : Because of the complexity and variety of HTML code, the number of > : : different tags, etc. I would suggest using an HTML parsing module for > : : this task. HTML::TokeParser::Simple has worked very well for me in the > : : past. There are a number of examples available. If this is what you > : : want and you get stuck on the module then come back with questions. > : : There are also the base modules such as HTML::Parser, etc. that the one > : : previously mentioned builds on, among others check CPAN. > : : > : : http://danconia.org > : : > : : Hanson, Rob wrote: > : : > A simple regex will do the trick... > : : > > : : > # untested > : : > $text = "..."; > : : > $text =~ s|<head>.*?</head>||s; > : : > > : : > Or something more generic... > : : > > : : > # untested > : : > $tag = "head"; > : : > $text =~ s|<$tag[^>]*?>.*?</$tag>||s; > : : > > : : > This second one also allows for possible attributes in the start tag. > : You > : : > may need more than this if the HTML isn't well formed, or if there are > : : extra > : : > spaces in your tags. > : : > > : : > If you want something for the command line you could do this... > : : > > : : > (Note: for *nix, needs modification for Win [untested]) > : : > perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html > > : : > newfile.html > : : > > : : > Rob > : : > > : : > > : : > -----Original Message----- > : : > From: Sara [mailto:[EMAIL PROTECTED] > : : > Sent: Wednesday, September 03, 2003 6:32 AM > : : > To: beginperl > : : > Subject: Stripping HTML from a text file. > : : > > : : > > : : > I have a couple of text files with html code in them.. e.g. > : : > > : : > ---------- Text File -------------- > : : > <html> > : : > <head> > : : > <title>This is Test File</title> > : : > </head> > : : > <body> > : : > <font size=2 face=arial>This is the test file contents<br> > : : > <p> > : : > blah blah blah......... > : : > </body> > : : > </html> > : : > > : : > ----------------------------------------- > : : > > : : > What I want to do is to remove/delete HTML code from the text file > from > : a > : : > certain tag upto certain tag. > : : > > : : > For example; I want to delete the code completely that comes in > between > : : > <head> and </head> (including any style tags and embedded javascripts > : etc) > : : > > : : > Any ideas? > : : > > : : > Thanks in advance. > : : > > : : > Sara. > : : > > : : > : : > : : -- > : : To unsubscribe, e-mail: [EMAIL PROTECTED] > : : For additional commands, e-mail: [EMAIL PROTECTED] > : > : > : -- > : To unsubscribe, e-mail: [EMAIL PROTECTED] > : For additional commands, e-mail: [EMAIL PROTECTED] > : > > > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]