Thanks everyone for the help thusfar. Ed and I have been chatting on code4lib ... it seems there are two problems. One is with the 9C character, which I now have a workaround for. I added the following to Charset.pm line 151:
if ($marc8 =~ /\x{9C}/) { $utf8 .= ' '; $index +=1; next CHAR_LOOP; } It's not ideal, but it gets rid of that problem well enough for me. The next problem happens with the following record (number 54 in the original batch I posted): http://liblime.com/public/prob2.mrc When I run the roundtrip conversion script I get the following error: Cannot decode string with wide characters at /usr/local/lib/perl/5.8.4/Encode.pm line 188. This time, the script just dies completely and nothing is written to disk. The record passes marcdump's tests. Ed, I'm still waiting for SF to update so I can nab that test script. In the meantime, any ideas how to track this one down? Cheers, -- Joshua Ferraro VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE President, Technology migration, training, maintenance, support LibLime Featuring Koha Open-Source ILS [EMAIL PROTECTED] |Full Demos at http://liblime.com/koha |1(888)KohaILS On Thu, May 18, 2006 at 11:16:52AM -0500, Edward Summers wrote: > So I got curious (thanks to your convo in #code4lib). I isolated the > problem to one record: > > http://www.inkdroid.org/tmp/one.dat > > Your roundtrip conversion complains: > > -- > > no mapping found at position 8 in Price : <9c> 7.99; Inv.# B > 476913; Date 06/03/98; Supplier : Dawson UK; Recd 20/03/98; > Contents : 1. The problem : 1. Don't bargain over positions; 2. > The method : 2. Separate the people from the problem; 3. > Focus on interests, not positions; 4. Invent options for mutual > gain; 5. Insist on using objective criteria; 3. Yes, but : > 6. What if they are more powerful? 7. What if they won't > play? 8. What if they use dirty tricks? 4. In conclusion; 5. > Ten questions people ask about getting to yes; g0=ASCII_DEFAULT > g1=EXTENDED_LATIN at /usr/local/lib/perl5/site_perl/5.8.7/MARC/ > Charset.pm line 126. > > -- > > So I took a look at that position in the marc record and found a 0x9C > character at that position, as the error message indicates. I can't > find a 0x9C in either of the mapping tables that this record purports > to use: > > BasicLatin (ASCII): http://lcweb2.loc.gov/cocoon/codetables/42.html > Extended Latin (ANSEL): http://lcweb2.loc.gov/cocoon/codetables/45.html > > Looks like you might want to preprocess those records before > translating. Since this character routinely occurs in the 586 field > you could use MARC::Record to remove the offending character before > writing as XML. > > Hope that helps somewhat. This character conversion stuff is a major > pain. > > //Ed