Re: UTF-8 encoding errors

Mike Rylander Wed, 07 Mar 2007 16:58:51 -0800

On 3/7/07, Bryan Baldus <[EMAIL PROTECTED]> wrote:

On Wednesday, March 07, 2007 2:34 PM, Ron Davies wrote:
>When I do this I get a number of error messages such as :
>"\x{00ce}" does not map to utf8 at myprogram.pl line xxx.
>and in the output file instead of the correct character there is a hex
>encoding. This happens with Greek but also perfectly ordinary Latin
>characters.


I can't offer any advice, but I am experiencing what may be similar
difficulties. I finally had a chance to get MARC::Charset and
MARC::File::XML installed and working, so I could try out xml2marc and
marc2xml. After creating a test record containing a field with diacritics, I
tried using marc2xml followed by xml2marc, hoping to end up with records
matching the original. marc2xml appears to have successfully translated the
raw MARC into MARCXML (it left the leader unchanged--no update to the record
length (though it did set byte 9 to 'a' for Unicode). Unfortunately,
attempting to use xml2marc on any of the .xml files I have results in an
empty file. In some cases I get a message:


Two things here, 1) there will be an new version of MARC::Charset out
soon-ish which is more forgiving and has mechanisms for dealing with
random (identifiable) encodings and 2) I'm not sure that the leader's
record-length field means anything in the context of MARCXML ... but
if anyone can think of some semantics for that I'll gladly implement
it.


"Cannot decode string with wide characters at C:/Perl/lib/Encode.pm line
184, <GEN1> line 1."

In other cases, I get no error messages, but still have an empty file. I
have tried a number of variations in the starting file: marc8.mrc->utf8.xml;
utf8.mrc->utf8.xml, MarcEdit-produced .xml->Perl-produced .mrc.

My system: Windows XP; ActivePerl  v5.8.2 built for MSWin32-x86-multi-thread
(Binary build 808)
MARC::Record: 2.0
Encode: 1.9801

Are these problems related to the age of my Perl or Encode?


This is almost certainly related to the issue that Josh has seen with,
um, sub-par SAX parsers.  He may be able to shed a little more light
on that, as I use the LibXML parser exclusively (and I've never had
issues getting utf-8 out...).

Josh? (he's currently on a plane, so it may be tomorrow...)


(If I remember correctly, before switching to MARC::Record 2.0, using
MARC::Record 1.39_1 and xml2marc resulted in records being output but the
field containing diacritics was mangled/deleted/replaced with bad data.)

Thank you for your assistance,

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija



--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: UTF-8 encoding errors

Reply via email to