On 03/27/2007 03:34 AM, Beginner wrote:
Hi,

I am trying to extract the iso code and country name from a 3 column table (taken from en.wikipedia.org) and have noticed a problem with accented characters such as Ô.

Below is my script and a sample of the data I am using. When I run the script the code beginning CI for Côte d'Ivoire returns the string

"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"

Does anyone know why \w+ does include Côte d'Ivoire and how I can get around it in future?

TIA,
Dp.


==== extract.pl ========
#!/usr/bin/perl

use strict;
use warnings;

my $file = 'iso-alpha2.txt';

open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
        chomp;
        next if ($_ !~ /^\w{2}\s+/);
my ($code,$name) = ($_ =~ /^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
        print "$code\t$name\n";
}
===============

======== sample data ========
...snip
BY      Belarus         Previously named "Byelorussian S.S.R."
BZ      Belize  
CA      Canada  
CC      Cocos (Keeling) Islands         
CD Congo, the Democratic Republic of the Previously named "Zaire" ZR
CF      Central African Republic        
CG      Congo   
CH Switzerland Code taken from "Confoederatio Helvetica", its official Latin name
CI      Côte d'Ivoire   
CK      Cook Islands    
CL      Chile   
CM Cameroon ===========


It's partly the encoding. Put «use encoding "iso-8859-1";» at the top of your program, and there will be a little improvement. However, that only gets you as far as "Côte d"; I doubt there is any encoding where apostrophe is in \w.

It's probably best to create an expression that contains all of the characters you may want. That would include accented characters and the apostrophe in this case.

Also, I advise you to use an programmer's editor that supports syntax highlighting. My VIM shows me that you missed the backslash that is supposed to be on the fourth "\s" in your regular expression.



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to