[Addendum] Stripping HTML from a text file.

Sara Thu, 04 Sep 2003 20:11:03 -0700

Okay, when I finally implemented this, again its not working ... any ideas?
simple regex like s/<head>/<blah>/g; is working but not this one?


----------------------------------------------------
#!/usr/bin/perl

use LWP::Simple;

print "Content-type: text/html\n\n";

$url = 'http://yahoo.com';

$html = get($url);

@html = split(/\n/,$html);

foreach $line(@html) {

chomp ($line);

$line =~ s|<head>.*?<\/head>||s;

print "$line\n";

}
#########################################



Thanks for any input.

Sara.



----- Original Message -----
From: "Sara" <[EMAIL PROTECTED]>
To: "Hanson, Rob" <[EMAIL PROTECTED]>; "'Wiggins d'Anconia'"
<[EMAIL PROTECTED]>
Cc: "beginperl" <[EMAIL PROTECTED]>
Sent: Wednesday, September 03, 2003 4:34 PM
Subject: Re: Stripping HTML from a text file.


: Thanks a lot Hanson,
:
: It worked for me.
:
: Yep, you are right "The regex way is good for quick and dirty HTML work."
:
: and especially for the newbies like me :))
:
: Sara.
:
:
: ----- Original Message -----
: From: "Hanson, Rob" <[EMAIL PROTECTED]>
: To: "'Wiggins d'Anconia'" <[EMAIL PROTECTED]>; "'Sara'"
: <[EMAIL PROTECTED]>
: Cc: "beginperl" <[EMAIL PROTECTED]>
: Sent: Friday, September 05, 2003 5:55 AM
: Subject: RE: Stripping HTML from a text file.
:
:
: : > Or maybe I misunderstood the question
: :
: : Or maybe I did :)
: :
: : > HTML::TokeParser::Simple
: :
: : I agree... but only if you are looking for a strong permanant solution.
: The
: : regex way is good for quick and dirty HTML work.
: :
: : Sara, if you need to keep the <head> tags, then you could use this
: modified
: : version...
: :
: : # untested
: : $text = "...";
: : $text =~ s|(<head>).*?(</head>)|$1$2|s;
: :
: : ...Or if you wanted to keep the <title> tag...
: :
: : # untested
: : $text = "...";
: : $text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s;
: :
: : Rob
: :
: : -----Original Message-----
: : From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED]
: : Sent: Thursday, September 04, 2003 8:48 PM
: : To: 'Sara'
: : Cc: beginperl
: : Subject: Re: Stripping HTML from a text file.
: :
: :
: : Won't this remove *everything* between the given tags? Or maybe I
: : misunderstood the question, I thought she wanted to remove the "code"
: : from all of the contents between two tags?
: :
: : Because of the complexity and variety of HTML code, the number of
: : different tags, etc. I would suggest using an HTML parsing module for
: : this task. HTML::TokeParser::Simple has worked very well for me in the
: : past.  There are a number of examples available. If this is what you
: : want and you get stuck on the module then come back with questions.
: : There are also the base modules such as HTML::Parser, etc. that the one
: : previously mentioned builds on, among others check CPAN.
: :
: : http://danconia.org
: :
: : Hanson, Rob wrote:
: : > A simple regex will do the trick...
: : >
: : > # untested
: : > $text = "...";
: : > $text =~ s|<head>.*?</head>||s;
: : >
: : > Or something more generic...
: : >
: : > # untested
: : > $tag = "head";
: : > $text =~ s|<$tag[^>]*?>.*?</$tag>||s;
: : >
: : > This second one also allows for possible attributes in the start tag.
: You
: : > may need more than this if the HTML isn't well formed, or if there are
: : extra
: : > spaces in your tags.
: : >
: : > If you want something for the command line you could do this...
: : >
: : > (Note: for *nix, needs modification for Win [untested])
: : > perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html >
: : > newfile.html
: : >
: : > Rob
: : >
: : >
: : > -----Original Message-----
: : > From: Sara [mailto:[EMAIL PROTECTED]
: : > Sent: Wednesday, September 03, 2003 6:32 AM
: : > To: beginperl
: : > Subject: Stripping HTML from a text file.
: : >
: : >
: : > I have a couple of text files with html code in them.. e.g.
: : >
: : > ---------- Text File --------------
: : > <html>
: : >     <head>
: : >         <title>This is Test File</title>
: : >     </head>
: : > <body>
: : > <font size=2 face=arial>This is the test file contents<br>
: : > <p>
: : > blah blah blah.........
: : > </body>
: : > </html>
: : >
: : > -----------------------------------------
: : >
: : > What I want to do is to remove/delete HTML code from the text file
from
: a
: : > certain tag upto certain tag.
: : >
: : > For example; I want to delete the code completely that comes in
between
: : > <head> and </head> (including any style tags and embedded javascripts
: etc)
: : >
: : > Any ideas?
: : >
: : > Thanks in advance.
: : >
: : > Sara.
: : >
: :
: :
: : --
: : To unsubscribe, e-mail: [EMAIL PROTECTED]
: : For additional commands, e-mail: [EMAIL PROTECTED]
:
:
: --
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[Addendum] Stripping HTML from a text file.

Reply via email to