----- "Leif Andersson" <[EMAIL PROTECTED]> wrote: > But what is the best way to deal with all those broken UTF8 encodings > we encounter over and over again when importing MARC records from > outer space? > > As it is now the application dies with something like > 'utf8 "\xXX" does not map to Unicode at C:/Perl/lib/Encode.pm line > 166.' > > The problem seems to lie in MARC::File::Encode > > sub marc_to_utf8 { > # if there is invalid utf8 date then this will through an > exception > # let's just hope it's valid :-) > return decode( 'UTF-8', $_[0], 1 ); > } > > Is it possible to introduce a "sloppy mode" switch? I think we need to introduce some additional unit tests. A while ago I encountered some bugs in SAX parsers that causes some pretty serious corruption of records in certain contexts (see: http://www.mail-archive.com/perl4lib@perl.org/msg01006.html).
In response to that I started creating a few such unit tests but never had a chance to finish. Here's what I started, maybe someone else has time to work on it a bit more: http://kados.org/stuff/marc_tests.tgz Even if the code is crap, the records are a pretty good sampling of what exists in real systems and it would be useful to run them through a roundtrip process as part of standard testing of the module. Cheers, -- Joshua Ferraro SUPPORT FOR OPEN-SOURCE SOFTWARE President, Technology migration, training, maintenance, support LibLime Featuring Koha Open-Source ILS [EMAIL PROTECTED] |Full Demos at http://liblime.com/koha |1(888)KohaILS