Re: [Addendum] Stripping HTML from a text file.

Octavian Rasnita Fri, 05 Sep 2003 06:46:18 -0700

It doesn't work because you check for the beginning and the end <head> tags
in the same line, but you have splitted the text in more lines, and the
<head> can be found in a line but the </head> tag on another line probably.



You can use something like this:

$html =~ s|<head[^>]*>.*?</head[^>]*>||si;

So you don't need to check each line separately.

It works for the  head tag but it doesn't work for all the tags, because you
might have a page like this:

<table>
<tr><td>a1</td>
<td>a2</td>
<td>a3</td></tr>
<tr><td>b1</td>
<td>
<table>
...
</table>
</td>
<td>b3</td></tr>
</table>

If you will want to delete everything from <table> until its corresponding
</table> it is more complicated because if you will use that simple regular
expression it will delete until the first </table> but not until the second
one.


----- Original Message ----- 
From: "Sara" <[EMAIL PROTECTED]>
To: "Sara" <[EMAIL PROTECTED]>; "Hanson, Rob" <[EMAIL PROTECTED]>;
"'Wiggins d'Anconia'" <[EMAIL PROTECTED]>
Cc: "beginperl" <[EMAIL PROTECTED]>
Sent: Wednesday, September 03, 2003 4:08 PM
Subject: [Addendum] Stripping HTML from a text file.


> Okay, when I finally implemented this, again its not working ... any
ideas?
> simple regex like s/<head>/<blah>/g; is working but not this one?
>
> ----------------------------------------------------
> #!/usr/bin/perl
>
> use LWP::Simple;
>
> print "Content-type: text/html\n\n";
>
> $url = 'http://yahoo.com';
>
> $html = get($url);
>
> @html = split(/\n/,$html);
>
> foreach $line(@html) {
>
> chomp ($line);
>
> $line =~ s|<head>.*?<\/head>||s;
>
> print "$line\n";
>
> }
> #########################################
>
>
>
> Thanks for any input.
>
> Sara.
>
>
>
> ----- Original Message -----
> From: "Sara" <[EMAIL PROTECTED]>
> To: "Hanson, Rob" <[EMAIL PROTECTED]>; "'Wiggins d'Anconia'"
> <[EMAIL PROTECTED]>
> Cc: "beginperl" <[EMAIL PROTECTED]>
> Sent: Wednesday, September 03, 2003 4:34 PM
> Subject: Re: Stripping HTML from a text file.
>
>
> : Thanks a lot Hanson,
> :
> : It worked for me.
> :
> : Yep, you are right "The regex way is good for quick and dirty HTML
work."
> :
> : and especially for the newbies like me :))
> :
> : Sara.
> :
> :
> : ----- Original Message -----
> : From: "Hanson, Rob" <[EMAIL PROTECTED]>
> : To: "'Wiggins d'Anconia'" <[EMAIL PROTECTED]>; "'Sara'"
> : <[EMAIL PROTECTED]>
> : Cc: "beginperl" <[EMAIL PROTECTED]>
> : Sent: Friday, September 05, 2003 5:55 AM
> : Subject: RE: Stripping HTML from a text file.
> :
> :
> : : > Or maybe I misunderstood the question
> : :
> : : Or maybe I did :)
> : :
> : : > HTML::TokeParser::Simple
> : :
> : : I agree... but only if you are looking for a strong permanant
solution.
> : The
> : : regex way is good for quick and dirty HTML work.
> : :
> : : Sara, if you need to keep the <head> tags, then you could use this
> : modified
> : : version...
> : :
> : : # untested
> : : $text = "...";
> : : $text =~ s|(<head>).*?(</head>)|$1$2|s;
> : :
> : : ...Or if you wanted to keep the <title> tag...
> : :
> : : # untested
> : : $text = "...";
> : : $text =~ s|(<head>).*?<title>.*?</title>.*?(</head>)|$1$2$3|s;
> : :
> : : Rob
> : :
> : : -----Original Message-----
> : : From: Wiggins d'Anconia [mailto:[EMAIL PROTECTED]
> : : Sent: Thursday, September 04, 2003 8:48 PM
> : : To: 'Sara'
> : : Cc: beginperl
> : : Subject: Re: Stripping HTML from a text file.
> : :
> : :
> : : Won't this remove *everything* between the given tags? Or maybe I
> : : misunderstood the question, I thought she wanted to remove the "code"
> : : from all of the contents between two tags?
> : :
> : : Because of the complexity and variety of HTML code, the number of
> : : different tags, etc. I would suggest using an HTML parsing module for
> : : this task. HTML::TokeParser::Simple has worked very well for me in the
> : : past.  There are a number of examples available. If this is what you
> : : want and you get stuck on the module then come back with questions.
> : : There are also the base modules such as HTML::Parser, etc. that the
one
> : : previously mentioned builds on, among others check CPAN.
> : :
> : : http://danconia.org
> : :
> : : Hanson, Rob wrote:
> : : > A simple regex will do the trick...
> : : >
> : : > # untested
> : : > $text = "...";
> : : > $text =~ s|<head>.*?</head>||s;
> : : >
> : : > Or something more generic...
> : : >
> : : > # untested
> : : > $tag = "head";
> : : > $text =~ s|<$tag[^>]*?>.*?</$tag>||s;
> : : >
> : : > This second one also allows for possible attributes in the start
tag.
> : You
> : : > may need more than this if the HTML isn't well formed, or if there
are
> : : extra
> : : > spaces in your tags.
> : : >
> : : > If you want something for the command line you could do this...
> : : >
> : : > (Note: for *nix, needs modification for Win [untested])
> : : > perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html >
> : : > newfile.html
> : : >
> : : > Rob
> : : >
> : : >
> : : > -----Original Message-----
> : : > From: Sara [mailto:[EMAIL PROTECTED]
> : : > Sent: Wednesday, September 03, 2003 6:32 AM
> : : > To: beginperl
> : : > Subject: Stripping HTML from a text file.
> : : >
> : : >
> : : > I have a couple of text files with html code in them.. e.g.
> : : >
> : : > ---------- Text File --------------
> : : > <html>
> : : >     <head>
> : : >         <title>This is Test File</title>
> : : >     </head>
> : : > <body>
> : : > <font size=2 face=arial>This is the test file contents<br>
> : : > <p>
> : : > blah blah blah.........
> : : > </body>
> : : > </html>
> : : >
> : : > -----------------------------------------
> : : >
> : : > What I want to do is to remove/delete HTML code from the text file
> from
> : a
> : : > certain tag upto certain tag.
> : : >
> : : > For example; I want to delete the code completely that comes in
> between
> : : > <head> and </head> (including any style tags and embedded
javascripts
> : etc)
> : : >
> : : > Any ideas?
> : : >
> : : > Thanks in advance.
> : : >
> : : > Sara.
> : : >
> : :
> : :
> : : --
> : : To unsubscribe, e-mail: [EMAIL PROTECTED]
> : : For additional commands, e-mail: [EMAIL PROTECTED]
> :
> :
> : --
> : To unsubscribe, e-mail: [EMAIL PROTECTED]
> : For additional commands, e-mail: [EMAIL PROTECTED]
> :
>
>
> -- 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Addendum] Stripping HTML from a text file.

Reply via email to