Re: MARC::Charset

Ashley Sanders Wed, 14 Mar 2007 01:57:56 -0800

Your MARC records appear to be encoded in MARC-8 as evidenced by "ergáo" in 
which the combining
accent character comes before the character to be modified.  I.e. the byte 
string that displays as
"ergáo" in your email would display as "ergò" (with a Latin small letter o with 
grave) in a MARC-8
aware client.

I'd just like to relate my recent experiences of retrieving MARC21records throughvarious library Z39.50 servers. Put simply, you cannot trust the MARCleader character

9 to correctly indicate the character set used.

From libraries that have set the leader to indicate the records are inthe MARC-8 character

set, I have retrieved records encoded as Latin-1, UTF-8 and MARC-8.

From libraries that set the leader to indicate Unicode, I get recordsin MARC-8

and UTF-8.

You also get encodings in MARC-8 records like \1EF6 to indicate aUnicode character.I think 〹 is now legal in MARC-8 now to indicate a Unicodecharacter that isn't

in the MARC-8 repertoire.

So, basically, you either need prior knowledge about the actualcharacter encoding

used, or you have to test. Testing for UTF-8 is fairly straightforward and a

long string of text (which admittedly you don't tend to get in MARCrecords) thattests as UTF-8 is very unlikely to be anything else. DistinguishingLatin-1 fromMARC-8 is a bit more like guess work. As a test for MARC-8 I look forthe common

combining diacritics followed by a vowel.

Regards,

Ashley.
--
Ashley Sanders               [EMAIL PROTECTED]
Copac http://copac.ac.uk A MIMAS Service funded by JISC

Re: MARC::Charset

Reply via email to