RE: MARC::Charset

Doran, Michael D Wed, 14 Mar 2007 06:13:41 -0800

Hi Henri-Damien,

> And any LOWERCASE DIGRAPH AE or UPPERCASE DIGRAPH AE or 
> LOWERCASE DIGRAPH OE is not well encoded. Encoding is 
> **assumed** to be latin1 translated into utf-8 in the 
> catalogue I am working on but appears respectively µ, ¥,¶
> in biblios.


        hex     MARC-8                  ISO-8859-1 (Latin-1)
-       ----    --------------------    --------------------
µ       0xB5    LOWERCASE DIGRAPH AE    MICRO SIGN
¥       0xA5    UPPERCASE DIGRAPH AE    YEN SIGN
¶       0xB6    LOWERCASE DIGRAPH OE    PILCROW SIGN

> Is there a way to fix things up ?

If the underlying numerical encoding in your MARC records for the digraphs in 
question is hex 0xB5, 0xA5, and 0xB6, then the character set is not Latin-1; it 
is MARC-8.  If that is the case, I don't believe that anything needs to be 
fixed; if you are using MARC::Charset to convert the records from MARC-8 to 
UTF-8, it should work.

However, it may also be that I am misunderstanding the issue.  It would help if 
you could provide the pertinent Perl code you are using for the character set 
translation and a couple of the MARC records with digraphs that are failing.

> ... but appears respectively µ, ¥,¶ in biblios.

Please excuse my ignorance, but what is 'biblios' in the context of this 
discussion?

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Henri-Damien LAURENT [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 14, 2007 4:18 AM
> To: Doran, Michael D; perl4lib
> Subject: Re: MARC::Charset
> 
> Doran, Michael D a écrit :
> > Hi Henri,
> >   
> > Although in my email client, the character in question 
> appears as a MICRO SIGN ("µ"), I am assuming that it is 
> actually meant to be a LOWERCASE DIGRAPH AE ("æ") since that 
> is consistent with the Latin vernacular text in your record.  
> In MARC-8, the LOWERCASE DIGRAPH AE character is a 
> precomposed character represented by 0xB5 in hex [1].  You 
> mention that you are using MARC::File::XML which in turn uses 
> MARC::Charset.  I'm wondering if there is some confusion as 
> to the expected encoding of the MARC records being 
> processed/converted?  If MARC::Charset is expecting MARC21 
> Unicode/UCS encoded records, but is actually getting MARC-8 
> encoded records, then in that context it likely wouldn't know 
> what to do with the 0xB5 octet and that might be the cause of 
> the error you are seeing.
> >
> > -- Michael
> >
> > [1] Your MARC records appear to be encoded in MARC-8 as 
> evidenced by "ergáo" in which the combining accent character 
> comes before the character to be modified.  I.e. the byte 
> string that displays as "ergáo" in your email would display 
> as "ergò" (with a Latin small letter o with grave) in a 
> MARC-8 aware client.
> >   
> >   
> Thanks for your answer.
> Well, this could be a precious hint.
> Indeed, in that catalogue I want to process, some books are 
> ancient books and were catalogued from OCLC or SUDOC.
> And any LOWERCASE DIGRAPH AE or UPPERCASE DIGRAPH AE or 
> LOWERCASE DIGRAPH OE is not well encoded. Encoding is 
> **assumed** to be latin1 translated into utf-8 in the 
> catalogue I am working on but appears respectively µ, ¥,¶ in biblios.
> 
> Is there a way to fix things up ?
> 
> --
> Henri Damien LAURENT et Paul POULAIN
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
>

RE: MARC::Charset

Reply via email to