Hi -

I am new to international character encoding and how the various
encodings are handled in perl. After a day of reading, I'm asking for help.

I am downloading data from an international (French) web site. The
HTTP headers show that the pages I am downloading are encoded
in iso-8859-1. Most characters (all the accented letters) are fine,
but some (i.e. the trade mark) are incorrect. Here is a working sample
script:

#!/usr/bin/perl

use strict;
use warnings FATAL => 'all';

use LWP::Simple;
use Encode;

binmode STDOUT, ":utf8";

my $content = get( "http://www.formula1.com/race/circuitdetail/773.html"; ) or
    die "get failed.\n";
my( $name ) = $content =~ /<td class="articleTitle">(.+?)<\/td>/s;
print "name w/o decode:\n";
print $name, "\n";

my $name1 = decode( 'iso-8859-1', $name );
print "name w/decode:\n";
print $name1, "\n";

$name =~ s/\x{99}/\x{2122}/g;
print "name manually converted:\n";
print $name, "\n";

The output is:

name w/o decode:
FORMULA 1 Gran Premio de España Telefónica 2007
name w/decode:
FORMULA 1 Gran Premio de España Telefónica 2007
name manually converted:
FORMULA 1™ Gran Premio de España Telefónica 2007

How do I get a proper conversion from iso-8859-1 to perl's internal utf8?
Is there a way to ask LWP:: to do this based on the character encoding
specified in the HTTP headers?

I am using:
This is perl, v5.8.8 built for x86_64-linux-gnu-thread-multi

on Debian unstable:
Linux hanako 2.6.18-4-amd64 #1 SMP Mon Mar 26 11:36:53 CEST 2007 x86_64 
GNU/Linux




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to