Re: MARC::Record and importing broken UTF8

Joshua M. Ferraro Fri, 02 Mar 2007 07:25:53 -0800

----- "Leif Andersson" <[EMAIL PROTECTED]> wrote:
> But what is the best way to deal with all those broken UTF8 encodings
> we encounter over and over again when importing MARC records from
> outer space?
> 
> As it is now the application dies with something like
> 'utf8 "\xXX" does not map to Unicode at C:/Perl/lib/Encode.pm line
> 166.'
> 
> The problem seems to lie in MARC::File::Encode
> 
> sub marc_to_utf8 {
>     # if there is invalid utf8 date then this will through an
> exception
>     # let's just hope it's valid :-)
>     return decode( 'UTF-8', $_[0], 1 );
> }
> 
> Is it possible to introduce a "sloppy mode" switch?
I think we need to introduce some additional unit tests. A while ago
I encountered some bugs in SAX parsers that causes some pretty
serious corruption of records in certain contexts (see:
http://www.mail-archive.com/perl4lib@perl.org/msg01006.html).


In response to that I started creating a few such unit tests but
never had a chance to finish. Here's what I started, maybe someone
else has time to work on it a bit more:

http://kados.org/stuff/marc_tests.tgz

Even if the code is crap, the records are a pretty good sampling of
what exists in real systems and it would be useful to run them
through a roundtrip process as part of standard testing of the 
module.

Cheers,

-- 
Joshua Ferraro                       SUPPORT FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
[EMAIL PROTECTED] |Full Demos at http://liblime.com/koha |1(888)KohaILS

Re: MARC::Record and importing broken UTF8

Reply via email to