I am working with an application running on Solaris where I am extracting a
couple of hundred UTF-8 MARC records from an ILS, extracting some basic
citation data and writing it out to a UTF-8 HTML file. I am using
MARC::Record 2.0 and Encode 2.12 which I use to open my output file thus:
open (OUTP,">:encoding(UTF-8)","$outFileName") ||
die "Cannot open new output file $outFileName\n";
I am extracting the data in a perfectly straightforward way, something like
this:
my @flds = $mr->field('245');
foreach my $fld_occ (@flds) {
my $val = $fld_occ->subfield('a');
$cite .= $val;
}
print OUTP "$cite\n";
When I do this I get a number of error messages such as :
"\x{00ce}" does not map to utf8 at myprogram.pl line xxx.
and in the output file instead of the correct character there is a hex
encoding. This happens with Greek but also perfectly ordinary Latin
characters.
There is nothing wrong with the UTF-8 encoding in the input data. The data
displays fine in the ILS, and when I hand check the coding in the MARC
record, it's correct.
What's more, if I take a _smaller_ subset of records (say about 50 records)
for my input file, all the data gets printed with no error message and with
the correct characters in the output. Then again, if I take a slightly
large subset, I get errors again, but not necessarily the same ones with
the same records.
Does anyone have any ideas about what's going on here? I have various data
files and outputs if anybody wants to take a closer look.
Thanks very much.
Ron
Ron Davies
Av. Baden-Powell 1 Bte 2, 1200 Brussels, Belgium
Email: ron(at)rondavies.be
Tel: +32 (0)2 770 33 51
GSM: +32 (0)484 502 393