Re: Stripping HTML from a text file.

Sara Thu, 04 Sep 2003 18:33:15 -0700

Thanks a lot Hanson,

It worked for me.


Yep, you are right "The regex way is good for quick and dirty HTML work."

and especially for the newbies like me :))

Sara.


----- Original Message -----
From: "Hanson, Rob" <[EMAIL PROTECTED]>
To: "'Wiggins d'Anconia'" <[EMAIL PROTECTED]>; "'Sara'"
<[EMAIL PROTECTED]>
Cc: "beginperl" <[EMAIL PROTECTED]>
Sent: Friday, September 05, 2003 5:55 AM
Subject: RE: Stripping HTML from a text file.


: > Or maybe I misunderstood the question
:
: Or maybe I did :)
:
: > HTML::TokeParser::Simple
:
: I agree... but only if you are looking for a strong permanant solution.
The
: regex way is good for quick and dirty HTML work.
:
: Sara, if you need to keep the <head> tags, then you could use this
modified
: version...
:
: # untested
: $text = "...";
: $text =~ s|(<head>).*?(</head>)|$1$2|s;
:
: ...Or if you wanted to keep the <title> tag...
:
: # untested
: $text = "...";
: $text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s;
:
: Rob
:
: -----Original Message-----
: From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED]
: Sent: Thursday, September 04, 2003 8:48 PM
: To: 'Sara'
: Cc: beginperl
: Subject: Re: Stripping HTML from a text file.
:
:
: Won't this remove *everything* between the given tags? Or maybe I
: misunderstood the question, I thought she wanted to remove the "code"
: from all of the contents between two tags?
:
: Because of the complexity and variety of HTML code, the number of
: different tags, etc. I would suggest using an HTML parsing module for
: this task. HTML::TokeParser::Simple has worked very well for me in the
: past.  There are a number of examples available. If this is what you
: want and you get stuck on the module then come back with questions.
: There are also the base modules such as HTML::Parser, etc. that the one
: previously mentioned builds on, among others check CPAN.
:
: http://danconia.org
:
: Hanson, Rob wrote:
: > A simple regex will do the trick...
: >
: > # untested
: > $text = "...";
: > $text =~ s|<head>.*?</head>||s;
: >
: > Or something more generic...
: >
: > # untested
: > $tag = "head";
: > $text =~ s|<$tag[^>]*?>.*?</$tag>||s;
: >
: > This second one also allows for possible attributes in the start tag.
You
: > may need more than this if the HTML isn't well formed, or if there are
: extra
: > spaces in your tags.
: >
: > If you want something for the command line you could do this...
: >
: > (Note: for *nix, needs modification for Win [untested])
: > perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html >
: > newfile.html
: >
: > Rob
: >
: >
: > -----Original Message-----
: > From: Sara [mailto:[EMAIL PROTECTED]
: > Sent: Wednesday, September 03, 2003 6:32 AM
: > To: beginperl
: > Subject: Stripping HTML from a text file.
: >
: >
: > I have a couple of text files with html code in them.. e.g.
: >
: > ---------- Text File --------------
: > <html>
: >     <head>
: >         <title>This is Test File</title>
: >     </head>
: > <body>
: > <font size=2 face=arial>This is the test file contents<br>
: > <p>
: > blah blah blah.........
: > </body>
: > </html>
: >
: > -----------------------------------------
: >
: > What I want to do is to remove/delete HTML code from the text file from
a
: > certain tag upto certain tag.
: >
: > For example; I want to delete the code completely that comes in between
: > <head> and </head> (including any style tags and embedded javascripts
etc)
: >
: > Any ideas?
: >
: > Thanks in advance.
: >
: > Sara.
: >
:
:
: --
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stripping HTML from a text file.

Reply via email to