RE: Character sets - kind of solved?

Doran, Michael D Mon, 06 Dec 2004 06:52:29 -0800

> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.


The MARC-21 standard allows for either MARC-8 or UCS/Unicode.  Position
09 in the record leader indicates the character encoding: a "blank" for
MARC-8, and an "a" for UCS/Unicode.  Perhaps your patch could test for
this and then only apply the transformation when required.  Note: I
believe the leader itself is limited to characters in the ASCII range,
so you wouldn't have to know the encoding of the record prior to parsing
the leader.

> What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.

The original record from John Hammer did not contain UTF-8, it contained
MARC-8.  I believe that the fact that the combining MARC-8 characters
were replaced by a generic replacement character only indicates that the
app he was using to view the data (post processing by MARC::Record) was
using a character set in which hex E5 and F2, encoded as single octets,
were not valid characters in that app's character set.  That app's
character set was apparently Unicode (UTF-8) and so E5 and F2 were
replaced by U+FFFD.  That's the long way of saying that the patch should
work fine in his case.  :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -----Original Message-----
> From: Mike Rylander [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, December 04, 2004 1:31 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Character sets - kind of solved?
> 
> I've run into some record encoding issues myself, though not the
> problem from below.  In any case, this got me thinking about the
> current state of MARC::File::XML, specifically that it could not
> handle MARC8 encoded records.
> 
> I submitted a patch a while back to hack around this, but that just
> lets us get the MARC records into well formed XML.  Basically, it just
> lets you set the encoding on the XML to something that has embedded
> 8-bit characters, like ISO-8859-1, aka LATIN1.
> 
> But that is far from optimal, since the data is being misinterpreted. 
> So I took a look at using MARC::Charset inside MARC::File::XML, and
> I've got a working patch that correctly transcodes records from
> USMARC(MARC-8) to MARC21slim(UTF8) and back again.
> 
> It's attached below, if anyone would be so kind as to test it.  If all
> goes well we sould be able to actually use MARC::File::XML in
> production.  If you do decide to test it, it requires MARC::Charset.
> 
> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.  What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.
> 
> The attached tarball contains a patched XML.pm and SAX.pm.  Replace
> your current MARC/File/XML.pm and MARC/File/SAX.pm with those and you
> should be good to go.  I've also included the scripts I used to test
> and one of my old MARC8 encoded records.  http://redlightgreen.com
> confirms that the illustrators name is properly transcoded.
> 
> On Fri, 3 Dec 2004 17:53:32 -0600, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> > First off, Ashley's suggestion that the original encoding was likely
> > MARC-8 is correct.  The author's Arabic name, 
> transliterated into the
> > Latin alphabet, should be "Bis{latin small letter a with 
> macron}{latin
> > small letter t with dot below}{latin small letter i with macron},
> > Mu{latin small letter h with dot below}ammad."  I am basing this on
> > MARC-21 records that can be seen in UCLA's online catalog 
> [1].  So, if
> > the above name is encoded in MARC-8 then the underlying 
> code would match
> > John's original code points [2]:
> >  > >> Looking at the name with a hex editor, it gives, with 
> hex values
> > in curly brackets,
> >  > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."
> > 
> > Then the question becomes: "What happened?"
> > 
> >  > >> the name now appears as
> >  > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > 
> > The fact that one byte turned into three bytes, suggests 
> UTF-8 encoding.
> > And the fact that *both* MARC-8 combining characters (i.e. "e5" and
> > "f2") now appear as the *same* combination of characters 
> (i.e. "ef bf
> > bd") suggests that it was not an encoding translation from one coded
> > character set to the equivalent codepoint in another 
> character set.  If
> > we assume UTF-8 and convert UTF-8 "ef bf bd" to its Unicode 
> code point,
> > we get U+FFFD [3].  If we look up U+FFFD we see that it is the
> > "REPLACEMENT CHARACTER" [4].
> > 
> > Since MARC::Record (obviously) would't object to the original MARC-8
> > character encoding, I'm guessing that sometime *after* 
> processing the
> > record with MARC::Record that it was either moved to, or 
> viewed in, a
> > client/platform/environment that was not MARC-8 savvy 
> (which is pretty
> > much everything) and that the client/platform/environment, not
> > recognizing the hex e5 and f2 as valid character encodings, replaced
> > them with the generic replacement character for that
> > client/platform/environment.
> > 
> > So I'm thinking that we can rule out MARC::Record and look closer at
> > what happened to the data subsequent to MARC::Record 
> processing.  That's
> > my guess anyway, and I'm sticking with it until I hear a 
> better story.
> > ;-)
> > 
> > [1] UCLA's Voyager ILMS has been upgraded to a Unicode 
> version, and is
> > able to display the characters accurately.  My assumption 
> is that the
> > author in the links below is the one in question.
> > See for example (looking at the title field, rather than 
> the underlined
> > author/name field):
> >  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603048
> >  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603049
> >  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=5053287
> >  http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=4490052
> > 
> > [2] In MARC-8, combining diacritic characters precede the base
> > character, and as Ashley pointed out, E5 is "macron" and F2 is "dot
> > below."
> > 
> > [3] hex "ef bf bd" = binary "11101111 10111111 10111101"
> > A three-octet UTF-8 character has the format of 1110xxxx 10xxxxxx
> > 10xxxxxx, with the "x" positions being the significant values in
> > determining the Unicode code point.  When we concatenate those x
> > position values from the above binary code, we get 1111111111111101,
> > which converted to hex, is FFFD
> > 
> > [4] See:
> > 
> http://rocky.uta.edu/doran/urdu/search.cgi?char_set=unicode&ch
ar_type=he
> > x&char_value=fffd
> >     (or just go to 
> http://rocky.uta.edu/doran/urdu/search.cgi and plug
> > in fffd
> > 
> > -- Michael
> > 
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 cell
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/
> > 
> > > -----Original Message-----
> > > From: Ashley Sanders [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, November 24, 2004 2:23 AM
> > > Cc: [EMAIL PROTECTED]
> > > Subject: Re: Character sets
> > >
> > > Ed Summers wrote:
> > > > On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
> > > >
> > > >>I have a character problem that I hope someone can help 
> me with. In
> > > >>a MARC record I am modifying using MARC::Record, one of 
> the names
> > > >>contains letters with diacritics. Looking at the name with a hex
> > editor,
> > > >>it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i,
> > > >>Mu{f2}hammad." After running through MARC::Record, the name now
> > appears
> > > >>as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > > >
> > > >
> > > > That's pretty odd. Any chance you could send me the 
> MARC record? At
> > this
> > > > time MARC::Record does not play nicely with Unicode (UTF8).
> > > >
> > > >     http://rt.cpan.org/NoAuth/Bug.html?id=3707
> > >
> > > It is possible they are MARC-8 characters rather than 
> utf-8. In MARC-8
> > > E5 is "macron" and F2 is "dot below." Is MARC::Record 
> trying to treat
> > > than as Unicode when in fact they are MARC-8?
> > >
> > > Ashley.
> > >
> > > --
> > > Ashley Sanders [EMAIL PROTECTED]
> > > Copac http://copac.ac.uk -- A MIMAS service funded by JISC
> > >
> > 
> 
> 
> -- 
> Mike Rylander
> [EMAIL PROTECTED]
> GPLS -- PINES Development
> Database Developer
> http://open-ils.org
>

RE: Character sets - kind of solved?

Reply via email to