Okay, when I finally implemented this, again its not working ... any ideas? simple regex like s/<head>/<blah>/g; is working but not this one?
---------------------------------------------------- #!/usr/bin/perl use LWP::Simple; print "Content-type: text/html\n\n"; $url = 'http://yahoo.com'; $html = get($url); @html = split(/\n/,$html); foreach $line(@html) { chomp ($line); $line =~ s|<head>.*?<\/head>||s; print "$line\n"; } ######################################### Thanks for any input. Sara. ----- Original Message ----- From: "Sara" <[EMAIL PROTECTED]> To: "Hanson, Rob" <[EMAIL PROTECTED]>; "'Wiggins d'Anconia'" <[EMAIL PROTECTED]> Cc: "beginperl" <[EMAIL PROTECTED]> Sent: Wednesday, September 03, 2003 4:34 PM Subject: Re: Stripping HTML from a text file. : Thanks a lot Hanson, : : It worked for me. : : Yep, you are right "The regex way is good for quick and dirty HTML work." : : and especially for the newbies like me :)) : : Sara. : : : ----- Original Message ----- : From: "Hanson, Rob" <[EMAIL PROTECTED]> : To: "'Wiggins d'Anconia'" <[EMAIL PROTECTED]>; "'Sara'" : <[EMAIL PROTECTED]> : Cc: "beginperl" <[EMAIL PROTECTED]> : Sent: Friday, September 05, 2003 5:55 AM : Subject: RE: Stripping HTML from a text file. : : : : > Or maybe I misunderstood the question : : : : Or maybe I did :) : : : : > HTML::TokeParser::Simple : : : : I agree... but only if you are looking for a strong permanant solution. : The : : regex way is good for quick and dirty HTML work. : : : : Sara, if you need to keep the <head> tags, then you could use this : modified : : version... : : : : # untested : : $text = "..."; : : $text =~ s|(<head>).*?(</head>)|$1$2|s; : : : : ...Or if you wanted to keep the <title> tag... : : : : # untested : : $text = "..."; : : $text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s; : : : : Rob : : : : -----Original Message----- : : From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED] : : Sent: Thursday, September 04, 2003 8:48 PM : : To: 'Sara' : : Cc: beginperl : : Subject: Re: Stripping HTML from a text file. : : : : : : Won't this remove *everything* between the given tags? Or maybe I : : misunderstood the question, I thought she wanted to remove the "code" : : from all of the contents between two tags? : : : : Because of the complexity and variety of HTML code, the number of : : different tags, etc. I would suggest using an HTML parsing module for : : this task. HTML::TokeParser::Simple has worked very well for me in the : : past. There are a number of examples available. If this is what you : : want and you get stuck on the module then come back with questions. : : There are also the base modules such as HTML::Parser, etc. that the one : : previously mentioned builds on, among others check CPAN. : : : : http://danconia.org : : : : Hanson, Rob wrote: : : > A simple regex will do the trick... : : > : : > # untested : : > $text = "..."; : : > $text =~ s|<head>.*?</head>||s; : : > : : > Or something more generic... : : > : : > # untested : : > $tag = "head"; : : > $text =~ s|<$tag[^>]*?>.*?</$tag>||s; : : > : : > This second one also allows for possible attributes in the start tag. : You : : > may need more than this if the HTML isn't well formed, or if there are : : extra : : > spaces in your tags. : : > : : > If you want something for the command line you could do this... : : > : : > (Note: for *nix, needs modification for Win [untested]) : : > perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html > : : > newfile.html : : > : : > Rob : : > : : > : : > -----Original Message----- : : > From: Sara [mailto:[EMAIL PROTECTED] : : > Sent: Wednesday, September 03, 2003 6:32 AM : : > To: beginperl : : > Subject: Stripping HTML from a text file. : : > : : > : : > I have a couple of text files with html code in them.. e.g. : : > : : > ---------- Text File -------------- : : > <html> : : > <head> : : > <title>This is Test File</title> : : > </head> : : > <body> : : > <font size=2 face=arial>This is the test file contents<br> : : > <p> : : > blah blah blah......... : : > </body> : : > </html> : : > : : > ----------------------------------------- : : > : : > What I want to do is to remove/delete HTML code from the text file from : a : : > certain tag upto certain tag. : : > : : > For example; I want to delete the code completely that comes in between : : > <head> and </head> (including any style tags and embedded javascripts : etc) : : > : : > Any ideas? : : > : : > Thanks in advance. : : > : : > Sara. : : > : : : : : : -- : : To unsubscribe, e-mail: [EMAIL PROTECTED] : : For additional commands, e-mail: [EMAIL PROTECTED] : : : -- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]