Alexandre Enkerli wrote: > Hello all, > This one is probably very easy for most of you and it would help me a > great deal if someone could tell me how to do it. I know there's a > bunch of tutorials, perldocs and manuals out there, but I'm getting > confused. > > I receive files with the following line format: > <tr> > <td> ENKA10577207 > </td> > <td> p1234567 > </td> > <td> Enkerli-Smith Tremblay, Alexandra Jean-Sébastien > </td> > <td> alexandre.jean-Sé[EMAIL PROTECTED] > </td> > </tr> > > These are: > Permanent code > Access code > Names, First names > Email address (usually of the [EMAIL PROTECTED], but not always) > > The "Permanent code" is made up of: > Last name's first three letters plus first name initial > day (01-31) > month+sex (month (01-12) for males, month+50 for females) > two-digit year (00-99) > extra digits (I don't know what they mean) > > What I'd like to get is a tab-delimited file with the following > Permanent code > Access code > Names > First names > Sex > Age (or, at least, formatted birthdate) > Email > > And then do calculations by age and sex. > Now, I've been doing this semi-manually, but I'm sure this is trivial > to do in Perl and it looks like an ideal learning opportunity. What > I've tried so far (with F[n]s, unpack, regexp...) doesn't really work. > A complete script (likely a one-liner) would be wonderful. > > Thanks in advance for your help. > > Alexandre Enkerli > Ph.D. Candidate > Department of Folklore and Ethnomusicology > Indiana University
given the well format-ness of your html code, you can try the following: open(HTML,'html.file') || die $!; while(<HTML>){ #-- found a row if(/<tr>/){ #-- read the next 7 lines. bad...bad... see below for reason push(@html,scalar <HTML>) foreach(1..7); chomp(@html); #-- get day/month/year from permanent code #-- assume the first 6 digits will be it my($day,$month,$year) = $html[0] =~ /(\d{2})(\d{2})(\d{2})/; my $sex = $month > 50 ? 'female' : 'male'; $month -= 50; $year += 1900; $html[0] =~ s/^.+;\s*//; #-- leave only permanent code $html[2] =~ s/^.+;\s*//; #-- leave only access code #-- name and first name my($name,$fname) = $html[4] =~ /^.+;\s*(.+),\s*(.+)$/; #-- email $html[6] =~ s/^.+;\s*//; #-- print them print "$html[0]\n$html[1]\n$name\n$fname\n$sex\n$month/$day/$year\n$html[6]\n"; #-- get ready for next round @html = (); } } close(HTML); __END__ the above is totally untested. try it and see what happen. it't not very reliable consider if you all the sudden have: <tr> <td> something </td> <!-- more <td> ... </td> --> </tr> notice that the <td> ... </td> pair is in 3 lines instead of 2, so the above will not work. Also, the empty lines between <tr> and the first <td> also mess up the above code. those are pretty easy to cope with because you can simply change the code to read each <td> ... </td> pair instead of blindly assume the next 7 lines will have everything. for example: if(/<td>/){ while(<HTML>){ /<\/td>/ ? last : push(@td,$_); } } or something similar. but even if you code that in your script. it still fails if you have something like: <tr> <!-- nested table --> <td> <table><tr><td></td></tr></table> </td> </tr> for those, you will want a html parser. go to CPAN and you can find some. hope this get you started. david -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]