Re: Processing Tagged Files (Real Newbie)

david Mon, 09 Sep 2002 10:45:11 -0700

Alexandre Enkerli wrote:

> Hello all,
> This one is probably very easy for most of you and it would help me a
> great deal if someone could tell me how to do it. I know there's a
> bunch of tutorials, perldocs and manuals out there, but I'm getting
> confused.
> 
> I receive files with the following line format:
> <tr>
> <td> &nbsp;ENKA10577207
> </td>
> <td> &nbsp;p1234567
> </td>
> <td> &nbsp;Enkerli-Smith Tremblay, Alexandra Jean-Sébastien
> </td>
> <td> &nbsp; alexandre.jean-Sé[EMAIL PROTECTED]
> </td>
> </tr>
> 
> These are:
> Permanent code
> Access code
> Names, First names
> Email address (usually of the [EMAIL PROTECTED], but not always)
> 
> The "Permanent code" is made up of:
> Last name's first three letters plus first name initial
> day (01-31)
> month+sex (month (01-12) for males, month+50 for females)
> two-digit year (00-99)
> extra digits (I don't know what they mean)
> 
> What I'd like to get is a tab-delimited file with the following
> Permanent code
> Access code
> Names
> First names
> Sex
> Age (or, at least, formatted birthdate)
> Email
> 
> And then do calculations by age and sex.
> Now, I've been doing this semi-manually, but I'm sure this is trivial
> to do in Perl and it looks like an ideal learning opportunity. What
> I've tried so far (with F[n]s, unpack, regexp...) doesn't really work.
> A complete script (likely a one-liner) would be wonderful.
> 
> Thanks in advance for your help.
> 
> Alexandre Enkerli
> Ph.D. Candidate
> Department of Folklore and Ethnomusicology
> Indiana University


given the well format-ness of your html code, you can try the following:

open(HTML,'html.file') || die $!;
while(<HTML>){
        #-- found a row
        if(/<tr>/){

                #-- read the next 7 lines. bad...bad... see below for reason
                push(@html,scalar <HTML>) foreach(1..7);
                chomp(@html);

                #-- get day/month/year from permanent code
                #-- assume the first 6 digits will be it
                my($day,$month,$year) = $html[0] =~ /(\d{2})(\d{2})(\d{2})/;

                my $sex = $month > 50 ? 'female' : 'male';
                $month -= 50;
                $year += 1900;

                $html[0] =~ s/^.+;\s*//; #-- leave only permanent code
                $html[2] =~ s/^.+;\s*//; #-- leave only access code

                #-- name and first name
                my($name,$fname) = $html[4] =~ /^.+;\s*(.+),\s*(.+)$/;

                #-- email
                $html[6] =~ s/^.+;\s*//;

                #-- print them
        
        print "$html[0]\n$html[1]\n$name\n$fname\n$sex\n$month/$day/$year\n$html[6]\n";
                
                #-- get ready for next round
                @html = ();
        }
}
close(HTML);

__END__

the above is totally untested. try it and see what happen. it't not very 
reliable consider if you all the sudden have:

<tr>


<td>
something
</td>
<!-- more <td> ... </td> -->
</tr>

notice that the <td> ... </td> pair is in 3 lines instead of 2, so the above 
will not work. Also, the empty lines between <tr> and the first <td> also 
mess up the above code. those are pretty easy to cope with because you can 
simply change the code to read each <td> ... </td> pair instead of blindly 
assume the next 7 lines will have everything. for example:

if(/<td>/){
        while(<HTML>){
                /<\/td>/ ? last : push(@td,$_);
        }
}

or something similar. but even if you code that in your script. it still 
fails if you have something like:

<tr>
<!-- nested table -->
<td>
<table><tr><td></td></tr></table>
</td>
</tr>

for those, you will want a html parser. go to CPAN and you can find some. 
hope this get you started.

david

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Processing Tagged Files (Real Newbie)

Reply via email to