Re: to clean and copy

John W. Krahn Mon, 24 Mar 2003 15:28:50 -0800

Adriano Allora wrote:
> 
> Hi to all,

Hello,


> my input = a directory in which I stored a certain amount of text files
> downloaded by some newsgroups;
> my desired_process = to clean all the files and copy them all in
> another directory;
> my desired_output = another directory in which there are all cleaned
> files;
> 
> my problem = my script doesn't run (I suppose: there's no output and
> top command line says that perl does not use the processor);
> my request = where I mistake?
> 
> ~~~~~~~~~~~~THE SCRIPT~~~~~~~~~~~~~~~~~~~~
> 
> #!/usr/bin/perl -w
> use strict;
> my $testo;

my $txt;

> @ARGV = <original_ones/*.txt>;
> while(<>){
>     tr/\015\012/\n/s;
>     tr/\=\"\*\^\_\-\+\' //s;

The only character you need to backslash is '-':

    tr/="*^_\-+' //s;

Or not, if it is at the beginning or end of the list:

    tr/-="*^_+' //s;

That is the same as:

    tr/-="*^_+' /-="*^_+' /s;

Which replaces multiple contiguous '-' with a single '-' AND multiple contiguous '='
with a single '=' AND multiple contiguous '"' with a single '"' AND etc.  Is that
what you want?


>     s/Newsgoups: it\..+|Subject: .+|Date: .+|Message-ID: .+|References: .+|Date: 
> .+//g;

Your regular expression is removing every thing up to but not including the newline
at the end.  Do you want to keep the blank line or remove the whole line?  Also, the
regular expression is not anchored so the literal strings will match if they are
anywhere in the line.

Remove whole line:

    $_ = '' if /^(?:Newsgoups: it\.|(?:Subject:|Date:|Message-ID:|References:) )/;

Leave the newline at the end:

    $_ = "\n" if /^(?:Newsgoups: it\.|(?:Subject:|Date:|Message-ID:|References:) )/;


>     s/(\w\')/$1 /g;
>     $txt .=$_;
>     if ( eof ) {
>         $txt =~ 
> s/(http:\/\/)?(\w){3,}(\.(\w(\-)?)+)+\.(\w){2,3}((\/)(\w)+((\.)(\w){1,5})*((\?)(\w){1,32}(=)(\w){1,32})*)*/
>  URL /g;

You are using a LOT of capturing parenthesis which means the regex is doing extra work.
Use non-capturing parenthesis instead.

        $txt =~ 
s!(?:http://)?\w{3,}(?:\.(?:\w-?)+)+\.\w{2,3}(?:/\w+(?:\.\w{1,5})*(?:\?\w{1,32}=\w{1,32})*)*!
 URL !g;


>         $txt =~ s/[0-9]+/ /g;

perl provides the \d character class which is the same as [0-9].

        $txt =~ s/\d+/ /g;

Or you could use tr///:

        $txt =~ tr/0-9/ /s;


>         $txt =~ tr/\n\r\f\t //s;

That is the same as:

        $txt =~ tr/\n\r\f\t /\n\r\f\t /s;

Which replaces multiple contiguous "\n" with a single "\n" AND multiple contiguous "\r"
with a single "\r" AND multiple contiguous "\f" with a single "\f" AND etc.  Is that
what you want?


> # here start problems, I suppose

Yes, this is it.


>         open (ACTUAL, ">-" . "cleaned_ones/$ARGV") || die "I can't open $ARGV 
> because $!\n";
                        ^^^^
You are trying to open the file "-cleaned_ones/$ARGV" but the directory "-cleaned_ones"
does not exist.


>         print $txt;
>         close ACTUAL;
>         $txt = "";
>     }
> }
> print "DONE.\n";


This will do what you want:

#!/usr/bin/perl -w
use strict;

local $/;  # slurp mode
@ARGV = <original_ones/*.txt>;
while ( <> ) {
    tr/\015\012/\n/s;
    tr/-="*^_+' //s;
    s/^(?:Newsgoups: it\.|(?:Subject:|Date:|Message-ID:|References:) ).+/gm;
    s/(\w')/$1 /g;
    s[(?: http:// )?
      \w{3,}
      (?:\.  (?:\w-?)+  )+
      \.\w{2,3}
      (?:/\w+  (?:\.\w{1,5})*  (?:\?\w{1,32}=\w{1,32})*  )*  ]
     [ URL ]xg;
    s/\d+/ /g;
    tr/\n\r\f\t //s;
    if ( eof ) {
        close ARGV;
        open ACTUAL, ">cleaned_ones/$ARGV" or die "I can't open $ARGV because $!";
        print ACTUAL;
        close ACTUAL;
        }
    }

__END__



John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: to clean and copy

Reply via email to