Re: Character sets - kind of solved?

Mike Rylander Sat, 04 Dec 2004 17:25:31 -0800

I've run into some record encoding issues myself, though not the
problem from below.  In any case, this got me thinking about the
current state of MARC::File::XML, specifically that it could not
handle MARC8 encoded records.


I submitted a patch a while back to hack around this, but that just
lets us get the MARC records into well formed XML.  Basically, it just
lets you set the encoding on the XML to something that has embedded
8-bit characters, like ISO-8859-1, aka LATIN1.

But that is far from optimal, since the data is being misinterpreted. 
So I took a look at using MARC::Charset inside MARC::File::XML, and
I've got a working patch that correctly transcodes records from
USMARC(MARC-8) to MARC21slim(UTF8) and back again.

It's attached below, if anyone would be so kind as to test it.  If all
goes well we sould be able to actually use MARC::File::XML in
production.  If you do decide to test it, it requires MARC::Charset.

One (perhaps large) caveat: as of now all USMARC records are assumed
to be MARC-8 encoded, and the data within is always run through
to_utf8/to_marc8 during XML export/import.  What that means is that
the records from the problem below (containing UTF8 directly in the
data, without an encoding marker) would probably break during export
to XML.

The attached tarball contains a patched XML.pm and SAX.pm.  Replace
your current MARC/File/XML.pm and MARC/File/SAX.pm with those and you
should be good to go.  I've also included the scripts I used to test
and one of my old MARC8 encoded records.  http://redlightgreen.com
confirms that the illustrators name is properly transcoded.

On Fri, 3 Dec 2004 17:53:32 -0600, Doran, Michael D <[EMAIL PROTECTED]> wrote:
> First off, Ashley's suggestion that the original encoding was likely
> MARC-8 is correct.  The author's Arabic name, transliterated into the
> Latin alphabet, should be "Bis{latin small letter a with macron}{latin
> small letter t with dot below}{latin small letter i with macron},
> Mu{latin small letter h with dot below}ammad."  I am basing this on
> MARC-21 records that can be seen in UCLA's online catalog [1].  So, if
> the above name is encoded in MARC-8 then the underlying code would match
> John's original code points [2]:
>  > >> Looking at the name with a hex editor, it gives, with hex values
> in curly brackets,
>  > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."
> 
> Then the question becomes: "What happened?"
> 
>  > >> the name now appears as
>  > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> 
> The fact that one byte turned into three bytes, suggests UTF-8 encoding.
> And the fact that *both* MARC-8 combining characters (i.e. "e5" and
> "f2") now appear as the *same* combination of characters (i.e. "ef bf
> bd") suggests that it was not an encoding translation from one coded
> character set to the equivalent codepoint in another character set.  If
> we assume UTF-8 and convert UTF-8 "ef bf bd" to its Unicode code point,
> we get U+FFFD [3].  If we look up U+FFFD we see that it is the
> "REPLACEMENT CHARACTER" [4].
> 
> Since MARC::Record (obviously) would't object to the original MARC-8
> character encoding, I'm guessing that sometime *after* processing the
> record with MARC::Record that it was either moved to, or viewed in, a
> client/platform/environment that was not MARC-8 savvy (which is pretty
> much everything) and that the client/platform/environment, not
> recognizing the hex e5 and f2 as valid character encodings, replaced
> them with the generic replacement character for that
> client/platform/environment.
> 
> So I'm thinking that we can rule out MARC::Record and look closer at
> what happened to the data subsequent to MARC::Record processing.  That's
> my guess anyway, and I'm sticking with it until I hear a better story.
> ;-)
> 
> [1] UCLA's Voyager ILMS has been upgraded to a Unicode version, and is
> able to display the characters accurately.  My assumption is that the
> author in the links below is the one in question.
> See for example (looking at the title field, rather than the underlined
> author/name field):
>  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603048
>  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603049
>  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=5053287
>  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=4490052
> 
> [2] In MARC-8, combining diacritic characters precede the base
> character, and as Ashley pointed out, E5 is "macron" and F2 is "dot
> below."
> 
> [3] hex "ef bf bd" = binary "11101111 10111111 10111101"
> A three-octet UTF-8 character has the format of 1110xxxx 10xxxxxx
> 10xxxxxx, with the "x" positions being the significant values in
> determining the Unicode code point.  When we concatenate those x
> position values from the above binary code, we get 1111111111111101,
> which converted to hex, is FFFD
> 
> [4] See:
> http://rocky.uta.edu/doran/urdu/search.cgi?char_set=unicode&char_type=he
> x&char_value=fffd
>     (or just go to http://rocky.uta.edu/doran/urdu/search.cgi and plug
> in fffd
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> > -----Original Message-----
> > From: Ashley Sanders [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, November 24, 2004 2:23 AM
> > Cc: [EMAIL PROTECTED]
> > Subject: Re: Character sets
> >
> > Ed Summers wrote:
> > > On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
> > >
> > >>I have a character problem that I hope someone can help me with. In
> > >>a MARC record I am modifying using MARC::Record, one of the names
> > >>contains letters with diacritics. Looking at the name with a hex
> editor,
> > >>it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i,
> > >>Mu{f2}hammad." After running through MARC::Record, the name now
> appears
> > >>as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > >
> > >
> > > That's pretty odd. Any chance you could send me the MARC record? At
> this
> > > time MARC::Record does not play nicely with Unicode (UTF8).
> > >
> > >     http://rt.cpan.org/NoAuth/Bug.html?id=3707
> >
> > It is possible they are MARC-8 characters rather than utf-8. In MARC-8
> > E5 is "macron" and F2 is "dot below." Is MARC::Record trying to treat
> > than as Unicode when in fact they are MARC-8?
> >
> > Ashley.
> >
> > --
> > Ashley Sanders [EMAIL PROTECTED]
> > Copac http://copac.ac.uk -- A MIMAS service funded by JISC
> >
> 


-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

marc-xml-fixup.tgz
Description: Binary data

Re: Character sets - kind of solved?

Reply via email to