UTF-8 encoding errors

Ron Davies Wed, 07 Mar 2007 12:34:20 -0800

I am working with an application running on Solaris where I am extracting acouple of hundred UTF-8 MARC records from an ILS, extracting some basiccitation data and writing it out to a UTF-8 HTML file. I am usingMARC::Record 2.0 and Encode 2.12 which I use to open my output file thus:


  open (OUTP,">:encoding(UTF-8)","$outFileName") ||
     die "Cannot open new output file $outFileName\n";

I am extracting the data in a perfectly straightforward way, something likethis:


  my @flds = $mr->field('245');
  foreach my $fld_occ (@flds) {
    my $val = $fld_occ->subfield('a');
    $cite .= $val;
  }
  print OUTP "$cite\n";

When I do this I get a number of error messages such as :
"\x{00ce}" does not map to utf8 at myprogram.pl line xxx.

and in the output file instead of the correct character there is a hexencoding. This happens with Greek but also perfectly ordinary Latincharacters.

There is nothing wrong with the UTF-8 encoding in the input data. The datadisplays fine in the ILS, and when I hand check the coding in the MARCrecord, it's correct.

What's more, if I take a _smaller_ subset of records (say about 50 records)for my input file, all the data gets printed with no error message and withthe correct characters in the output. Then again, if I take a slightlylarge subset, I get errors again, but not necessarily the same ones withthe same records.

Does anyone have any ideas about what's going on here? I have various datafiles and outputs if anybody wants to take a closer look.


Thanks very much.

Ron

Ron Davies
Av. Baden-Powell 1  Bte 2, 1200 Brussels, Belgium
Email:  ron(at)rondavies.be
Tel:    +32 (0)2 770 33 51
GSM:    +32 (0)484 502 393

UTF-8 encoding errors

Reply via email to