----- "Leif Andersson" <[EMAIL PROTECTED]> wrote:
> But what is the best way to deal with all those broken UTF8 encodings
> we encounter over and over again when importing MARC records from
> outer space?
>
> As it is now the application dies with something like
> 'utf8 "\xXX" does not map to Unicode at C:/Perl/lib/Encode.pm line
> 166.'
>
> The problem seems to lie in MARC::File::Encode
>
> sub marc_to_utf8 {
> # if there is invalid utf8 date then this will through an
> exception
> # let's just hope it's valid :-)
> return decode( 'UTF-8', $_[0], 1 );
> }
>
> Is it possible to introduce a "sloppy mode" switch?
I think we need to introduce some additional unit tests. A while ago
I encountered some bugs in SAX parsers that causes some pretty
serious corruption of records in certain contexts (see:
http://www.mail-archive.com/[email protected]/msg01006.html).
In response to that I started creating a few such unit tests but
never had a chance to finish. Here's what I started, maybe someone
else has time to work on it a bit more:
http://kados.org/stuff/marc_tests.tgz
Even if the code is crap, the records are a pretty good sampling of
what exists in real systems and it would be useful to run them
through a roundtrip process as part of standard testing of the
module.
Cheers,
--
Joshua Ferraro SUPPORT FOR OPEN-SOURCE SOFTWARE
President, Technology migration, training, maintenance, support
LibLime Featuring Koha Open-Source ILS
[EMAIL PROTECTED] |Full Demos at http://liblime.com/koha |1(888)KohaILS