I am working with an application running on Solaris where I am extracting a couple of hundred UTF-8 MARC records from an ILS, extracting some basic citation data and writing it out to a UTF-8 HTML file. I am using MARC::Record 2.0 and Encode 2.12 which I use to open my output file thus:

  open (OUTP,">:encoding(UTF-8)","$outFileName") ||
     die "Cannot open new output file $outFileName\n";

I am extracting the data in a perfectly straightforward way, something like this:

  my @flds = $mr->field('245');
  foreach my $fld_occ (@flds) {
    my $val = $fld_occ->subfield('a');
    $cite .= $val;
  }
  print OUTP "$cite\n";

When I do this I get a number of error messages such as :
"\x{00ce}" does not map to utf8 at myprogram.pl line xxx.
and in the output file instead of the correct character there is a hex encoding. This happens with Greek but also perfectly ordinary Latin characters.

There is nothing wrong with the UTF-8 encoding in the input data. The data displays fine in the ILS, and when I hand check the coding in the MARC record, it's correct.

What's more, if I take a _smaller_ subset of records (say about 50 records) for my input file, all the data gets printed with no error message and with the correct characters in the output. Then again, if I take a slightly large subset, I get errors again, but not necessarily the same ones with the same records.

Does anyone have any ideas about what's going on here? I have various data files and outputs if anybody wants to take a closer look.

Thanks very much.

Ron

Ron Davies
Av. Baden-Powell 1  Bte 2, 1200 Brussels, Belgium
Email:  ron(at)rondavies.be
Tel:    +32 (0)2 770 33 51
GSM:    +32 (0)484 502 393 

Reply via email to