Oops, this got mangled somehow... > U+0044 LATIN CAPITAL LETTER D > U+006F LATIN SMALL LETTER O > U+006E LATIN SMALL LETTER N > U+0074 LATIN SMALL LETTER T > U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF > U+0073 LATIN SMALL LETTER S > U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 006F > U+LATIN SMALL LETTER O > U+0076 LATIN SMALL LETTER V > U+0061 LATIN SMALL LETTER A > U+002C COMMA > U+0020 SPACE, BLANK / SPACE > U+0044 LATIN CAPITAL LETTER D > U+0061 LATIN SMALL LETTER A > U+0072 LATIN SMALL LETTER R > U+02B9 SOFT SIGN, PRIME / MODIFIER LETTER PRIME > U+0069 LATIN SMALL LETTER I > U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF > U+0061 LATIN SMALL LETTER A > U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 002E > U+PERIOD, DECIMAL POINT / FULL STOP
And should have been this: U+0044 LATIN CAPITAL LETTER D U+006F LATIN SMALL LETTER O U+006E LATIN SMALL LETTER N U+0074 LATIN SMALL LETTER T U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF U+0073 LATIN SMALL LETTER S U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF U+006F LATIN SMALL LETTER O U+0076 LATIN SMALL LETTER V U+0061 LATIN SMALL LETTER A U+002C COMMA U+0020 SPACE, BLANK / SPACE U+0044 LATIN CAPITAL LETTER D U+0061 LATIN SMALL LETTER A U+0072 LATIN SMALL LETTER R U+02B9 SOFT SIGN, PRIME / MODIFIER LETTER PRIME U+0069 LATIN SMALL LETTER I U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF U+0061 LATIN SMALL LETTER A U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF U+002E PERIOD, DECIMAL POINT / FULL STOP -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Doran, Michael D > Sent: Friday, May 18, 2007 1:17 PM > To: perl4lib@perl.org > Subject: RE: MARC::Charset question > > Hi Michael, > > > An example is the author (personal name) of the book that > can be found > > at http://catalog.loc.gov/ by searching for ISBN > > 5040039875 (I'm guessing the fact that the website appears to be > > displaying a corrupted name may be part of the problem here). > > The Library of Congress catalog is outputting the MARC data > to your browser in Unicode UTF-8 and it looks correct to me. > It may *appear* corrupted, depending on what font you choose > to display the encoding (try Arial Unicode MS if you are in a > Windows environment). > > > This name is 'Dontsova, Daria' (approximately), > > Below is the UTF-16 encoding of the name in question, based > on a copy-and-paste directly from the browser > (http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?BBID=12550873). > > U+0044 LATIN CAPITAL LETTER D > U+006F LATIN SMALL LETTER O > U+006E LATIN SMALL LETTER N > U+0074 LATIN SMALL LETTER T > U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF > U+0073 LATIN SMALL LETTER S > U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 006F > U+LATIN SMALL LETTER O > U+0076 LATIN SMALL LETTER V > U+0061 LATIN SMALL LETTER A > U+002C COMMA > U+0020 SPACE, BLANK / SPACE > U+0044 LATIN CAPITAL LETTER D > U+0061 LATIN SMALL LETTER A > U+0072 LATIN SMALL LETTER R > U+02B9 SOFT SIGN, PRIME / MODIFIER LETTER PRIME > U+0069 LATIN SMALL LETTER I > U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF > U+0061 LATIN SMALL LETTER A > U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 002E > U+PERIOD, DECIMAL POINT / FULL STOP > > > > ... in hex: > > 446f6eeb74ec736f76612c20446172a7eb69ec612e. > > When transcoded by marc8_to_utf8() the result is > > 446f6e74cda173006f76612c20446172cab969cda161002e > > - which contains 2 null (00) characters. > > 44 6f 6e [eb] 74 [ec] 73 6f 76 61 2c 20 44 61 72 [a7] > [eb] 69 [ec] 61 2e > 44 6f 6e 74 [cd a1] 73 [00] 6f 76 61 2c 20 44 61 72 [ca > b9] 69 [cd a1] 61 [00] 2e > > Hmmmm. It looks like the MARC-8 'COMBINING LIGATURE LEFT > HALF' ("0xEB") and/or the MARC-8 'COMBINING LIGATURE RIGHT > HALF' ("0xEC") got converted to a Unicode 'COMBINING DOUBLE > INVERTED BREVE' ("0xCD 0xA1" in UTF-8 [1]). That doesn't > sound like something that MARC::Charset would do. > > -- Michael > > [1] Unicode Character 'COMBINING DOUBLE INVERTED BREVE' (U+0361) > http://www.fileformat.info/info/unicode/char/0361/index.htm > > # Michael Doran, Systems Librarian > # University of Texas at Arlington > # 817-272-5326 office > # 817-688-1926 mobile > # [EMAIL PROTECTED] > # http://rocky.uta.edu/doran/ > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Sent: Friday, May 18, 2007 5:49 AM > > To: perl4lib@perl.org; [EMAIL PROTECTED] > > Subject: MARC::Charset question > > > > Hi, > > > > I'm using marc8_to_utf8() on Library of Congress data. I'm finding > > that I get occasional null characters inserted in the > output text, and > > I'm wondering what this means. > > > > An example is the author (personal name) of the book that > can be found > > at http://catalog.loc.gov/ by searching for ISBN > > 5040039875 (I'm guessing the fact that the website appears to be > > displaying a corrupted name may be part of the problem here). > > > > This name is 'Dontsova, Daria' (approximately), in hex: > > 446f6eeb74ec736f76612c20446172a7eb69ec612e. When transcoded by > > marc8_to_utf8() the result is > > 446f6e74cda173006f76612c20446172cab969cda161002e - which contains 2 > > null (00) characters. > > > > Is it safe to ignore these null characters (i.e. strip them > out of the > > result, which otherwise seems good)? > > > > Thanks, > > > > Michael > >