RE: Stripping HTML from a text file.

Hanson, Rob Thu, 04 Sep 2003 17:57:21 -0700

> Or maybe I misunderstood the question

Or maybe I did :)

> HTML::TokeParser::Simple

I agree... but only if you are looking for a strong permanant solution.  The
regex way is good for quick and dirty HTML work.

Sara, if you need to keep the <head> tags, then you could use this modified
version...

# untested
$text = "...";
$text =~ s|(<head>).*?(</head>)|$1$2|s;

...Or if you wanted to keep the <title> tag...

# untested
$text = "...";
$text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s;

Rob

-----Original Message-----
From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 04, 2003 8:48 PM
To: 'Sara'
Cc: beginperl
Subject: Re: Stripping HTML from a text file.

Won't this remove *everything* between the given tags? Or maybe I 
misunderstood the question, I thought she wanted to remove the "code" 
from all of the contents between two tags?

Because of the complexity and variety of HTML code, the number of 
different tags, etc. I would suggest using an HTML parsing module for 
this task. HTML::TokeParser::Simple has worked very well for me in the 
past.  There are a number of examples available. If this is what you 
want and you get stuck on the module then come back with questions. 
There are also the base modules such as HTML::Parser, etc. that the one 
previously mentioned builds on, among others check CPAN.

http://danconia.org

Hanson, Rob wrote:
> A simple regex will do the trick...
> 
> # untested
> $text = "...";
> $text =~ s|<head>.*?</head>||s;
> 
> Or something more generic...
> 
> # untested
> $tag = "head";
> $text =~ s|<$tag[^>]*?>.*?</$tag>||s;
> 
> This second one also allows for possible attributes in the start tag.  You
> may need more than this if the HTML isn't well formed, or if there are
extra
> spaces in your tags.
> 
> If you want something for the command line you could do this...
> 
> (Note: for *nix, needs modification for Win [untested])
> perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html >
> newfile.html
> 
> Rob
> 
> 
> -----Original Message-----
> From: Sara [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 03, 2003 6:32 AM
> To: beginperl
> Subject: Stripping HTML from a text file.
> 
> 
> I have a couple of text files with html code in them.. e.g.
> 
> ---------- Text File --------------
> <html>
>     <head>
>         <title>This is Test File</title>
>     </head>
> <body>
> <font size=2 face=arial>This is the test file contents<br>
> <p>
> blah blah blah.........
> </body>
> </html>
> 
> -----------------------------------------
> 
> What I want to do is to remove/delete HTML code from the text file from a
> certain tag upto certain tag.
> 
> For example; I want to delete the code completely that comes in between
> <head> and </head> (including any style tags and embedded javascripts etc)
> 
> Any ideas?
> 
> Thanks in advance.
> 
> Sara.
> 

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Stripping HTML from a text file.

Reply via email to