Re: Stripping HTML from a text file.

drieux Thu, 04 Sep 2003 17:54:53 -0700


On Wednesday, Sep 3, 2003, at 03:32 US/Pacific, Sara wrote:
[..]

What I want to do is to remove/delete HTML code from the text file from a certain tag upto certain tag.

For example; I want to delete the code completely that comes in between <head> and </head> (including any style tags and embedded javascripts etc)

Any ideas?


I would recommend that you look into HTML-Tree
<http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/>
since I have found it a lovely way to do most anything
that you would want to know how to do about deconstructing
the tree structure of an HTML document.

I'm not too sure you really want to blitz EVERYTHING in
the 'head' section...

but you might try say:

        while ( my $line = <INFO1> ) {
                if ( $line =~ /<head>/ ) {
                        #
                        # remove everything after it. print if we have something
                        #
                        $line =~ s/<head>.*//;
                        print $line unless ($line =~ /^\s*$/);
                        #
                        # spin until we see the closing tag - assumes well formedness
                        #
                        do { $line = <INFO1>; } until ( $line =~ /<\/head>/ ) ;
                        #
                        # strip everything before the closing tag
                        #
                        $line =~ s/.*<\/head>//;
                        next if ($line =~ /^\s*$/); # get new line if blank.
                }
                print $line ;
        }
but this assumes that the start and stop tags do not have
something else on the same line with them - eg

        </head><body text="#000000" bgcolor="#FFFFFF">....
ciao
drieux

---


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stripping HTML from a text file.

Reply via email to