Oops, this got mangled somehow...

> U+0044  LATIN CAPITAL LETTER D
> U+006F  LATIN SMALL LETTER O
> U+006E  LATIN SMALL LETTER N
> U+0074  LATIN SMALL LETTER T
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0073  LATIN SMALL LETTER S
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 006F  
> U+LATIN SMALL LETTER O
> U+0076  LATIN SMALL LETTER V
> U+0061  LATIN SMALL LETTER A
> U+002C  COMMA
> U+0020  SPACE, BLANK / SPACE
> U+0044  LATIN CAPITAL LETTER D
> U+0061  LATIN SMALL LETTER A
> U+0072  LATIN SMALL LETTER R
> U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
> U+0069  LATIN SMALL LETTER I
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0061  LATIN SMALL LETTER A
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 002E  
> U+PERIOD, DECIMAL POINT / FULL STOP

And should have been this:

U+0044  LATIN CAPITAL LETTER D
U+006F  LATIN SMALL LETTER O
U+006E  LATIN SMALL LETTER N
U+0074  LATIN SMALL LETTER T
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0073  LATIN SMALL LETTER S
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+006F  LATIN SMALL LETTER O
U+0076  LATIN SMALL LETTER V
U+0061  LATIN SMALL LETTER A
U+002C  COMMA
U+0020  SPACE, BLANK / SPACE
U+0044  LATIN CAPITAL LETTER D
U+0061  LATIN SMALL LETTER A
U+0072  LATIN SMALL LETTER R
U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
U+0069  LATIN SMALL LETTER I
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0061  LATIN SMALL LETTER A
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+002E  PERIOD, DECIMAL POINT / FULL STOP

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Doran, Michael D 
> Sent: Friday, May 18, 2007 1:17 PM
> To: perl4lib@perl.org
> Subject: RE: MARC::Charset question
> 
> Hi Michael,
> 
> > An example is the author (personal name) of the book that 
> can be found 
> > at http://catalog.loc.gov/ by searching for ISBN
> > 5040039875 (I'm guessing the fact that the website appears to be 
> > displaying a corrupted name may be part of the problem here).
> 
> The Library of Congress catalog is outputting the MARC data 
> to your browser in Unicode UTF-8 and it looks correct to me.  
> It may *appear* corrupted, depending on what font you choose 
> to display the encoding (try Arial Unicode MS if you are in a 
> Windows environment).
> 
> > This name is 'Dontsova, Daria' (approximately),
> 
> Below is the UTF-16 encoding of the name in question, based 
> on a copy-and-paste directly from the browser 
> (http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?BBID=12550873).
> 
> U+0044  LATIN CAPITAL LETTER D
> U+006F  LATIN SMALL LETTER O
> U+006E  LATIN SMALL LETTER N
> U+0074  LATIN SMALL LETTER T
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0073  LATIN SMALL LETTER S
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 006F  
> U+LATIN SMALL LETTER O
> U+0076  LATIN SMALL LETTER V
> U+0061  LATIN SMALL LETTER A
> U+002C  COMMA
> U+0020  SPACE, BLANK / SPACE
> U+0044  LATIN CAPITAL LETTER D
> U+0061  LATIN SMALL LETTER A
> U+0072  LATIN SMALL LETTER R
> U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
> U+0069  LATIN SMALL LETTER I
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0061  LATIN SMALL LETTER A
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 002E  
> U+PERIOD, DECIMAL POINT / FULL STOP
> 
> 
> > ... in hex:
> > 446f6eeb74ec736f76612c20446172a7eb69ec612e.
> > When transcoded by marc8_to_utf8() the result is 
> > 446f6e74cda173006f76612c20446172cab969cda161002e
> > - which contains 2 null (00) characters.
> 
> 44 6f 6e [eb] 74    [ec] 73      6f 76 61 2c 20 44 61 72 [a7] 
>    [eb] 69 [ec]    61      2e
> 44 6f 6e      74 [cd a1] 73 [00] 6f 76 61 2c 20 44 61 72 [ca 
> b9]      69 [cd a1] 61 [00] 2e
> 
> Hmmmm.  It looks like the MARC-8 'COMBINING LIGATURE LEFT 
> HALF' ("0xEB") and/or the MARC-8 'COMBINING LIGATURE RIGHT 
> HALF' ("0xEC") got converted to a Unicode 'COMBINING DOUBLE 
> INVERTED BREVE' ("0xCD 0xA1" in UTF-8 [1]).  That doesn't 
> sound like something that MARC::Charset would do.
> 
> -- Michael
> 
> [1] Unicode Character 'COMBINING DOUBLE INVERTED BREVE' (U+0361)
>     http://www.fileformat.info/info/unicode/char/0361/index.htm
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Friday, May 18, 2007 5:49 AM
> > To: perl4lib@perl.org; [EMAIL PROTECTED]
> > Subject: MARC::Charset question
> > 
> > Hi,
> > 
> > I'm using marc8_to_utf8() on Library of Congress data. I'm finding 
> > that I get occasional null characters inserted in the 
> output text, and 
> > I'm wondering what this means.
> > 
> > An example is the author (personal name) of the book that 
> can be found 
> > at http://catalog.loc.gov/ by searching for ISBN
> > 5040039875 (I'm guessing the fact that the website appears to be 
> > displaying a corrupted name may be part of the problem here).
> > 
> > This name is 'Dontsova, Daria' (approximately), in hex:
> > 446f6eeb74ec736f76612c20446172a7eb69ec612e. When transcoded by
> > marc8_to_utf8() the result is
> > 446f6e74cda173006f76612c20446172cab969cda161002e - which contains 2 
> > null (00) characters.
> > 
> > Is it safe to ignore these null characters (i.e. strip them 
> out of the 
> > result, which otherwise seems good)?
> > 
> > Thanks,
> > 
> > Michael
> > 

Reply via email to