> One (perhaps large) caveat: as of now all USMARC records are assumed > to be MARC-8 encoded, and the data within is always run through > to_utf8/to_marc8 during XML export/import.
The MARC-21 standard allows for either MARC-8 or UCS/Unicode. Position 09 in the record leader indicates the character encoding: a "blank" for MARC-8, and an "a" for UCS/Unicode. Perhaps your patch could test for this and then only apply the transformation when required. Note: I believe the leader itself is limited to characters in the ASCII range, so you wouldn't have to know the encoding of the record prior to parsing the leader. > What that means is that > the records from the problem below (containing UTF8 directly in the > data, without an encoding marker) would probably break during export > to XML. The original record from John Hammer did not contain UTF-8, it contained MARC-8. I believe that the fact that the combining MARC-8 characters were replaced by a generic replacement character only indicates that the app he was using to view the data (post processing by MARC::Record) was using a character set in which hex E5 and F2, encoded as single octets, were not valid characters in that app's character set. That app's character set was apparently Unicode (UTF-8) and so E5 and F2 were replaced by U+FFFD. That's the long way of saying that the patch should work fine in his case. :-) -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Mike Rylander [mailto:[EMAIL PROTECTED] > Sent: Saturday, December 04, 2004 1:31 PM > To: [EMAIL PROTECTED] > Subject: Re: Character sets - kind of solved? > > I've run into some record encoding issues myself, though not the > problem from below. In any case, this got me thinking about the > current state of MARC::File::XML, specifically that it could not > handle MARC8 encoded records. > > I submitted a patch a while back to hack around this, but that just > lets us get the MARC records into well formed XML. Basically, it just > lets you set the encoding on the XML to something that has embedded > 8-bit characters, like ISO-8859-1, aka LATIN1. > > But that is far from optimal, since the data is being misinterpreted. > So I took a look at using MARC::Charset inside MARC::File::XML, and > I've got a working patch that correctly transcodes records from > USMARC(MARC-8) to MARC21slim(UTF8) and back again. > > It's attached below, if anyone would be so kind as to test it. If all > goes well we sould be able to actually use MARC::File::XML in > production. If you do decide to test it, it requires MARC::Charset. > > One (perhaps large) caveat: as of now all USMARC records are assumed > to be MARC-8 encoded, and the data within is always run through > to_utf8/to_marc8 during XML export/import. What that means is that > the records from the problem below (containing UTF8 directly in the > data, without an encoding marker) would probably break during export > to XML. > > The attached tarball contains a patched XML.pm and SAX.pm. Replace > your current MARC/File/XML.pm and MARC/File/SAX.pm with those and you > should be good to go. I've also included the scripts I used to test > and one of my old MARC8 encoded records. http://redlightgreen.com > confirms that the illustrators name is properly transcoded. > > On Fri, 3 Dec 2004 17:53:32 -0600, Doran, Michael D > <[EMAIL PROTECTED]> wrote: > > First off, Ashley's suggestion that the original encoding was likely > > MARC-8 is correct. The author's Arabic name, > transliterated into the > > Latin alphabet, should be "Bis{latin small letter a with > macron}{latin > > small letter t with dot below}{latin small letter i with macron}, > > Mu{latin small letter h with dot below}ammad." I am basing this on > > MARC-21 records that can be seen in UCLA's online catalog > [1]. So, if > > the above name is encoded in MARC-8 then the underlying > code would match > > John's original code points [2]: > > > >> Looking at the name with a hex editor, it gives, with > hex values > > in curly brackets, > > > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad." > > > > Then the question becomes: "What happened?" > > > > > >> the name now appears as > > > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad." > > > > The fact that one byte turned into three bytes, suggests > UTF-8 encoding. > > And the fact that *both* MARC-8 combining characters (i.e. "e5" and > > "f2") now appear as the *same* combination of characters > (i.e. "ef bf > > bd") suggests that it was not an encoding translation from one coded > > character set to the equivalent codepoint in another > character set. If > > we assume UTF-8 and convert UTF-8 "ef bf bd" to its Unicode > code point, > > we get U+FFFD [3]. If we look up U+FFFD we see that it is the > > "REPLACEMENT CHARACTER" [4]. > > > > Since MARC::Record (obviously) would't object to the original MARC-8 > > character encoding, I'm guessing that sometime *after* > processing the > > record with MARC::Record that it was either moved to, or > viewed in, a > > client/platform/environment that was not MARC-8 savvy > (which is pretty > > much everything) and that the client/platform/environment, not > > recognizing the hex e5 and f2 as valid character encodings, replaced > > them with the generic replacement character for that > > client/platform/environment. > > > > So I'm thinking that we can rule out MARC::Record and look closer at > > what happened to the data subsequent to MARC::Record > processing. That's > > my guess anyway, and I'm sticking with it until I hear a > better story. > > ;-) > > > > [1] UCLA's Voyager ILMS has been upgraded to a Unicode > version, and is > > able to display the characters accurately. My assumption > is that the > > author in the links below is the one in question. > > See for example (looking at the title field, rather than > the underlined > > author/name field): > > http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603048 > > http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603049 > > http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=5053287 > > http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=4490052 > > > > [2] In MARC-8, combining diacritic characters precede the base > > character, and as Ashley pointed out, E5 is "macron" and F2 is "dot > > below." > > > > [3] hex "ef bf bd" = binary "11101111 10111111 10111101" > > A three-octet UTF-8 character has the format of 1110xxxx 10xxxxxx > > 10xxxxxx, with the "x" positions being the significant values in > > determining the Unicode code point. When we concatenate those x > > position values from the above binary code, we get 1111111111111101, > > which converted to hex, is FFFD > > > > [4] See: > > > http://rocky.uta.edu/doran/urdu/search.cgi?char_set=unicode&ch ar_type=he > > x&char_value=fffd > > (or just go to > http://rocky.uta.edu/doran/urdu/search.cgi and plug > > in fffd > > > > -- Michael > > > > # Michael Doran, Systems Librarian > > # University of Texas at Arlington > > # 817-272-5326 office > > # 817-688-1926 cell > > # [EMAIL PROTECTED] > > # http://rocky.uta.edu/doran/ > > > > > -----Original Message----- > > > From: Ashley Sanders [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, November 24, 2004 2:23 AM > > > Cc: [EMAIL PROTECTED] > > > Subject: Re: Character sets > > > > > > Ed Summers wrote: > > > > On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote: > > > > > > > >>I have a character problem that I hope someone can help > me with. In > > > >>a MARC record I am modifying using MARC::Record, one of > the names > > > >>contains letters with diacritics. Looking at the name with a hex > > editor, > > > >>it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i, > > > >>Mu{f2}hammad." After running through MARC::Record, the name now > > appears > > > >>as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad." > > > > > > > > > > > > That's pretty odd. Any chance you could send me the > MARC record? At > > this > > > > time MARC::Record does not play nicely with Unicode (UTF8). > > > > > > > > http://rt.cpan.org/NoAuth/Bug.html?id=3707 > > > > > > It is possible they are MARC-8 characters rather than > utf-8. In MARC-8 > > > E5 is "macron" and F2 is "dot below." Is MARC::Record > trying to treat > > > than as Unicode when in fact they are MARC-8? > > > > > > Ashley. > > > > > > -- > > > Ashley Sanders [EMAIL PROTECTED] > > > Copac http://copac.ac.uk -- A MIMAS service funded by JISC > > > > > > > > -- > Mike Rylander > [EMAIL PROTECTED] > GPLS -- PINES Development > Database Developer > http://open-ils.org >