Hi Laurence,
> I'm trying to create MARC records from serials data exported
> from SFX, using MARC::Charset version 0.98 to convert UTF-8
> strings to MARC-8. It seems to be failing on extended latin
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE
The encoding, U+00C5 (CAPITAL LETTER A WITH RING ABOVE), is a precomposed
character [1]. While U+00C5 is a perfectly good Unicode encoding, I believe
that it is still the recommended practice for Unicode-encoded MARC-21 records
to use base and combining characters, and U+00C5 doesn't have a direct
equivalent in the MARC-21 repertoire [2,3].
If the strings are first normalized using Unicode Normalization Form D, they
should convert okay [4,5].
> The records convert using Terry Reese's MarcEdit OK.
Perhaps MarcEdit incorporates the decomposition or has direct conversion of
precomposed Unicode to decomposed MARC-8.
-- Michael
[1] The decomposition (i.e. base and combining character) values for "CAPITAL
LETTER A WITH RING ABOVE" would be U+0041 (LATIN CAPITAL LETTER A) followed by
U+030A (COMBINING RING ABOVE).
[2] WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM USMARC TO
UNICODE/UCS
* Accented letters ... will continue to be encoded as a base letter
and non-spacing marks. Use of precomposed accented letters is not
sanctioned at this stage.
From "USMARC Character Set Issues and Mapping to Unicode/UCS"
http://www.loc.gov/marc/marbi/1996/96-10.html
[3] MARC 21 Specifications > CHARACTER SETS > Code Tables
http://www.loc.gov/marc/specifications/specchartables.html
[4] Preprocessing Requirements
... preprocessing of the Unicode record before the conversion to
MARC-8 takes place. In all of the above techniques, the following
steps for decomposing diacritics were presumed.
Decompose the precomposed base character/character modifier combinations
using Unicode Normalization Form D (NFD) which produces exact equivalents,
and primarily applies decomposition to precomposed characters with
diacritics.
From "Technique for conversion of Unicode to MARC-8"
http://www.loc.gov/marc/marbi/2006/2006-04.html
[5] W3C > Charlint - A Character Normalization Tool
http://www.w3.org/International/charlint/
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
> -Original Message-
> From: Laurence Lockton [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, September 18, 2007 5:21 AM
> To: perl4lib@perl.org
> Subject: MARC::Charset 'utf8_to_marc8'
>
> Hi,
>
> I'm trying to create MARC records from serials data exported
> from SFX, using MARC::Charset version 0.98 to convert UTF-8
> strings to MARC-8. It seems to be failing on extended latin
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE,
> giving "no mapping found at position 176" for example.
> The records convert using Terry Reese's MarcEdit OK. Am I
> doing the wrong thing? Any advice gratefully received.
>
> Many thanks,
> Laurence Lockton
> University of Bath
> UK
>