MARC::Charset 'utf8_to_marc8'

2007-09-18 Thread Laurence Lockton

Hi,

I'm trying to create MARC records from serials data exported from SFX, 
using  MARC::Charset version 0.98 to convert UTF-8 strings to MARC-8. It 
seems to be failing on extended latin characters like U+00C5 CAPITAL LETTER 
A WITH RING ABOVE, giving "no mapping found at position 176" for example. 
The records convert using Terry Reese's MarcEdit OK. Am I doing the wrong 
thing? Any advice gratefully received.


Many thanks,
Laurence Lockton
University of Bath
UK


Re: wide character in print in MARC::File::XML

2007-09-18 Thread Mike Rylander
On 8/19/07, Rina Ron <[EMAIL PROTECTED]> wrote:
> I got the error "wide character in print at IO/Handle.pm line 399" and
> solved it in a not nice way by using an object as a hash reference.
> Perhaps there is a better way.

I have updated CVS with code that checks the default output encoding
and sets the binmode on the file appropriately.  I also added support
to the out() method for an optional encoding parameter to override the
default encoding.  For instructions on installing MARC-XML from
source, please see http://sourceforge.net/cvs/?group_id=1254 -- the
module name is marc-xml.  I would appreciate any testing you (or
anyone on the list) can provide.

Thanks in advance, folks.

-- 
Mike Rylander
Equinox Software, Inc
[EMAIL PROTECTED]
http://esilibrary.com/


RE: MARC::Charset 'utf8_to_marc8'

2007-09-18 Thread Doran, Michael D
Hi Laurence,

> I'm trying to create MARC records from serials data exported 
> from SFX, using  MARC::Charset version 0.98 to convert UTF-8 
> strings to MARC-8. It seems to be failing on extended latin 
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE

The encoding, U+00C5 (CAPITAL LETTER A WITH RING ABOVE), is a precomposed 
character [1].  While U+00C5 is a perfectly good Unicode encoding, I believe 
that it is still the recommended practice for Unicode-encoded MARC-21 records 
to use base and combining characters, and U+00C5 doesn't have a direct 
equivalent in the MARC-21 repertoire [2,3].

If the strings are first normalized using Unicode Normalization Form D, they 
should convert okay [4,5].  

> The records convert using Terry Reese's MarcEdit OK.

Perhaps MarcEdit incorporates the decomposition or has direct conversion of 
precomposed Unicode to decomposed MARC-8.

-- Michael 

[1] The decomposition (i.e. base and combining character) values for "CAPITAL 
LETTER A WITH RING ABOVE" would be U+0041 (LATIN CAPITAL LETTER A) followed by 
U+030A (COMBINING RING ABOVE).

[2] WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM USMARC TO 
UNICODE/UCS

   * Accented letters ... will continue to be encoded as a base letter
 and non-spacing marks. Use of precomposed accented letters is not
 sanctioned at this stage.

From "USMARC Character Set Issues and Mapping to Unicode/UCS"
http://www.loc.gov/marc/marbi/1996/96-10.html 

[3] MARC 21 Specifications > CHARACTER SETS > Code Tables
http://www.loc.gov/marc/specifications/specchartables.html

[4] Preprocessing Requirements

... preprocessing of the Unicode record before the conversion to
MARC-8 takes place. In all of the above techniques, the following
steps for decomposing diacritics were presumed.

Decompose the precomposed base character/character modifier combinations
using Unicode Normalization Form D (NFD) which produces exact equivalents,
and primarily applies decomposition to precomposed characters with 
diacritics.

From "Technique for conversion of Unicode to MARC-8"
http://www.loc.gov/marc/marbi/2006/2006-04.html

[5] W3C > Charlint - A Character Normalization Tool
http://www.w3.org/International/charlint/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Laurence Lockton [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 18, 2007 5:21 AM
> To: perl4lib@perl.org
> Subject: MARC::Charset 'utf8_to_marc8'
> 
> Hi,
> 
> I'm trying to create MARC records from serials data exported 
> from SFX, using  MARC::Charset version 0.98 to convert UTF-8 
> strings to MARC-8. It seems to be failing on extended latin 
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE, 
> giving "no mapping found at position 176" for example. 
> The records convert using Terry Reese's MarcEdit OK. Am I 
> doing the wrong thing? Any advice gratefully received.
> 
> Many thanks,
> Laurence Lockton
> University of Bath
> UK
>