Hi All.

I'm trying to either strip everything outside the <body> tags or match
everything inside them for three nearly identical web pages. I need the
content of the pages minus header/footer.

Ultimately I need to glue them all together into one valid html doc but 
that's for later.

In general, should I be regexing for the stuff I want or the stuff I don't
want? Does it matter which? If I "strip" out the html I don't want, do I
save a step when it comes to the next phase, operating on the remaining
content?

--------------------- stripping method (not working) ----------------------
        
        my @urls = qw(
                'http://colomanager/saved/1358381265931.htm'
                'http://colomanager/saved/1375561135115.htm'
                'http://colomanager/saved/1388446037003.htm'
                 );

foreach my $u (@urls) {

        get "$u\n"; 

while ($u =~ m/(.*?)<html>.*<body>(.*)/ /(.*?)<\/body>.*<\/html>(.*)/);

        print "$1\n";
-------------------------- end stripping method ------------------------ 
# The above does not compile.
-------------------------- extracting (it compiles) --------------------

my @urls = qw(
        'http://colomanager/saved/1358381265931.htm'
        'http://colomanager/saved/1375561135115.htm'
        'http://colomanager/saved/1388446037003.htm'
         );

         foreach my $u (@urls) {

                 get "$u\n";

                 $u =~ /^<div .*<\/div>$/;

# What next? Where does the extracted stuff go? How do I acess it?

                                                         } 


-- 
Monkeys are superior to men in this:
when a monkey looks into a mirror, he sees a monkey.

-- Malcolm De Chazal


----- End forwarded message -----

-- 
Monkeys are superior to men in this:
when a monkey looks into a mirror, he sees a monkey.

-- Malcolm De Chazal

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to