Thanks a lot Hanson, It worked for me.
Yep, you are right "The regex way is good for quick and dirty HTML work." and especially for the newbies like me :)) Sara. ----- Original Message ----- From: "Hanson, Rob" <[EMAIL PROTECTED]> To: "'Wiggins d'Anconia'" <[EMAIL PROTECTED]>; "'Sara'" <[EMAIL PROTECTED]> Cc: "beginperl" <[EMAIL PROTECTED]> Sent: Friday, September 05, 2003 5:55 AM Subject: RE: Stripping HTML from a text file. : > Or maybe I misunderstood the question : : Or maybe I did :) : : > HTML::TokeParser::Simple : : I agree... but only if you are looking for a strong permanant solution. The : regex way is good for quick and dirty HTML work. : : Sara, if you need to keep the <head> tags, then you could use this modified : version... : : # untested : $text = "..."; : $text =~ s|(<head>).*?(</head>)|$1$2|s; : : ...Or if you wanted to keep the <title> tag... : : # untested : $text = "..."; : $text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s; : : Rob : : -----Original Message----- : From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED] : Sent: Thursday, September 04, 2003 8:48 PM : To: 'Sara' : Cc: beginperl : Subject: Re: Stripping HTML from a text file. : : : Won't this remove *everything* between the given tags? Or maybe I : misunderstood the question, I thought she wanted to remove the "code" : from all of the contents between two tags? : : Because of the complexity and variety of HTML code, the number of : different tags, etc. I would suggest using an HTML parsing module for : this task. HTML::TokeParser::Simple has worked very well for me in the : past. There are a number of examples available. If this is what you : want and you get stuck on the module then come back with questions. : There are also the base modules such as HTML::Parser, etc. that the one : previously mentioned builds on, among others check CPAN. : : http://danconia.org : : Hanson, Rob wrote: : > A simple regex will do the trick... : > : > # untested : > $text = "..."; : > $text =~ s|<head>.*?</head>||s; : > : > Or something more generic... : > : > # untested : > $tag = "head"; : > $text =~ s|<$tag[^>]*?>.*?</$tag>||s; : > : > This second one also allows for possible attributes in the start tag. You : > may need more than this if the HTML isn't well formed, or if there are : extra : > spaces in your tags. : > : > If you want something for the command line you could do this... : > : > (Note: for *nix, needs modification for Win [untested]) : > perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html > : > newfile.html : > : > Rob : > : > : > -----Original Message----- : > From: Sara [mailto:[EMAIL PROTECTED] : > Sent: Wednesday, September 03, 2003 6:32 AM : > To: beginperl : > Subject: Stripping HTML from a text file. : > : > : > I have a couple of text files with html code in them.. e.g. : > : > ---------- Text File -------------- : > <html> : > <head> : > <title>This is Test File</title> : > </head> : > <body> : > <font size=2 face=arial>This is the test file contents<br> : > <p> : > blah blah blah......... : > </body> : > </html> : > : > ----------------------------------------- : > : > What I want to do is to remove/delete HTML code from the text file from a : > certain tag upto certain tag. : > : > For example; I want to delete the code completely that comes in between : > <head> and </head> (including any style tags and embedded javascripts etc) : > : > Any ideas? : > : > Thanks in advance. : > : > Sara. : > : : : -- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]