Re: pattern substitution

Rob Dixon Sat, 18 Nov 2006 07:18:59 -0800

Adriano Allora wrote:


hi to all,


Hi Adriano. Read my comments in-line and my solution at the end.

I've got a list of tagged words, like this one (only a little bit longest):

<tLn nr=11>
e       CON     e
le      DET:def il
ha      VER:pres        avere|riavere
detto   VER:pper        dire
<       NOM     <unknown>
CORR    VER:infi        corre
 >       NOM     <unknown>
e       CON     e
a       PRE     a


Posting a short example is fine, as long as the full data doesn't contain any
records significantly different from any of those in your sample. See later.

I need to transform the list below in (in which the CORR tag isn't tagged):

<tLn nr=11>
e       CON     e
le      DET:def il
ha      VER:pres        avere|riavere
detto   VER:pper        dire
<CORR>
e       CON     e
a       PRE     a

So I tried to write this awful script:


Don't think I'm not going to tell you it's awful just because you've said so
already :)

#!/usr/bin/perl -w

use strict;


and always

use warnings;

as well. It didn't show any problems in this case but it's a useful thing to
have around.

$^I = '';


Better not to use this at all while you're testing, as you'll keep overwriting
the input data and have to restore it.

my $tic = 0;
my  $toc = 0;


You're getting confused by having two flags when you need only one: you're
either inside a tag or you're not.

while(<>)
    {
    if(/^<       NOM     <unknown>.*/i)


No need for the .* at the end of the regex. And isn't it unnecessarily long?
Unless there are other records that look similar in your source data that you
don't want to match here I'd say

 if (/^<\s/) {

was adequate. (Starts with a left angle bracket followed by white space.)

        {
        $tic = 1;
        next;
        }
    next if /^>       NOM     <unknown>.*/i;
    next if $toc == 1;


Here is your problem. $toc is getting set inside the following if statement and
is never reset, so the loop then cycles unproductively until the end of the
input.

    $toc = 0;
    if($tic==1)
        {
        s/^(\/?\w+).+/$1/gi;


There's no need for the /i modifier as you don't have any literal letters to
match. And there's no need for the /g as you have no /m so ^ can match only
once. (The line has only one beginning!) It's also clearer to use different
delimiters when you're matching slashes instead of escaping them, like this:

 s|^(/?\w+).+|$1|gi;

It looks like your records can also begin with a slash. You didn't show us any
data like that. This is the sort of thing I was talking about earlier - your
example data can be a subset of the real thing but it needs to be fully
representative.

And in the end all you've done is strip off the tail of the line after the first
word. As far as I can tell

 s/\s.*//;

is adequate. (Remove everything at and after the first white space.)

        chomp();


A bit late to be doing the chomp here isn't it? Put it straight after the while
statement and then we know where we are for the entirety oif the loop. And
anyway the substitution you just did in the line before will have lopped off the
trailing newline anyway.

By the way, you don't need the parentheses.

        $_ = "<$_>";


It's a misconception to try to turn the actual input record into what the output
should be; after all, you didn't say $_='' above when the output should be
blank, you simply called next. Simply build the output record that you want and
print it. You'll see this in my solution below.

        $toc = 1;
        $tic = 0;
        }
    s/<>//g;


What's this for? If it's to fix a bug in your code that's generating '<>' when
it shouldn't then you should fix the bug instead. Or if there are records in
your source data that need this stripping out then you haven't shown any.
Otherwise I don't understand its purpose.

    print;
    }

it doesn't return errors, but it stop printing the output after the
first correction. Someone can explain me why and eventually suggest how
to correct the corrector?


Yep, awful script. Mainly because of unclear thinking I believe. If I was to ask
you what states $tic and $toc represented you couldn't easily tell me, and
because you don't really know what they do (they are yours after all and you
should know what they're for!) you've tried setting and clearing them at odd
places and failed to get it right.

Thanks at all,


alladr

PS: another strange thing: if I declare at the beginning of the script:
my($tic,$toc); it returns me an error...


It would have been nice if you had told us the error, but my guess is that it's
"variable masks earlier declaration", because you will then have declared them
twice.

Here's the fix. It has a single flag $tag, and it means the script is processing
'inside' a tag. It's set when a line beginning with a lone '<' is found and
cleared when the corresponding '>' appears.

HTH,

Rob


use strict;
use warnings;

#$^I = '';

my $tag;

while (<>) {
 chomp;

 if(/^[<>]\s/) {
   $tag = /^</;
 }
 elsif ($tag) {
   /(^\S+)/;
   print "<$1>\n";
 }
 else {
   print "$_\n";
 }
}


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: pattern substitution

Reply via email to