It seems there is a little bug (by design) kicking in. The leader gets wrong and some characters get wrong in this case: + Reading a raw marc record (utf8) from file + Turning it into a MARC::Record object + Without modification writing it out to file. Yes. Even without modification the bug manifests itself!
Let's start with code simply copying one record from a file utf8.mrc containing one or more marc records. This basic operation not involving MARC::Record is OK. #!perl -w use strict; # open(IN, "utf8.mrc") || die "1"; open(OUT, ">out_good.mrc") || die "2"; binmode IN; binmode OUT; # # Read in raw MARC $/ = "\x1D"; my $marc = <IN>; print OUT $marc; __END__ Now, we're adding MARC::Record to the process, along with some debug info. Example code producing *faulty* record: #!perl -w use strict; use MARC::Record; use Devel::Peek; # open(IN, "utf8.mrc") || die "1"; open(OUT, ">out_bad.mrc") || die "2"; binmode IN; binmode OUT; # # Read in raw MARC $/ = "\x1D"; my $marc = <IN>; Dump($marc); # the utf8-flag is not on my $obj = MARC::Record->new_from_usmarc( $marc ); # Convert back to raw MARC my $marc2 = $obj->as_usmarc(); Dump($marc2); # the utf8-flag IS on print OUT $marc2; __END__ In this case the leader and actual length will not agree, as your utf8 characters have turned into latin1. The problem is that $marc2 has the utf8 flag set internally by Perl. And the conversion on output is made in spite of binmode. We can get around the problem by either (for instance) use bytes; or Encode::_utf8_off($marc2); before printing to file. But shouldn't MARC::Record take care of this for us? A file of MARC records may contain records in different encodings. The text parts of a MARC record can be treated as made up by certain encodings, but the "blob" itself, I suppose, should be exposed to the caller as pure binary. Are there any drawbacks in letting MARC::Record strip off any eventual utf8 flag before returning the record as_usmarc() ? If not I suggest this change be made to a future release of MARC::Record. I shall also add that this character mess only sets in when doing IO. If you are updating your databases through one API or another you are probably OK! Leif ====================================== Leif Andersson, Systems Librarian Stockholm University Library SE-106 91 Stockholm SWEDEN Phone : +46 8 162769 Mobile: +46 70 6904281 -----Ursprungligt meddelande----- Från: Doran, Michael D [mailto:[EMAIL PROTECTED] Skickat: den 21 februari 2008 18:49 Till: perl4lib@perl.org Ämne: RE: Help for utf-8 output Hi Jackie, I'm working on a very similar problem... converting theses/dissertations records (in XML) to MARC records. I'm still in the testing stage, but have had similar problems with records with diacritics in the 100 or 245 fields (however diacritics in a 520a field don't seem to cause any problems). Since our records are not "diacritic rich" it's hard to determine the exact extent of the problem. I am using these versions: Perl v5.8.8 MARC::Charset 0.98 MARC::Lint 1.43 MARC::Record 2.0 XML::LibXML 1.66 Here's an example "bad" record (which I have minimized to just the 245 field): marcdump test.mrc test.mrc LDR 00127cam a2200037 4500 245 13 _aAn Empirical Test Of The Situational Leadership® Model In Japan / _cRiho Yoshioka. Recs Errs Filename ----- ----- -------- 1 1 test.mrc When I run test.mrc through MARC::Lint, I get this message: Invalid record length in record 1: Leader says 00127 bytes but it's actually 125 Invalid length in directory for tag 245 in record 1 field does not end in end of field character in tag 245 in record 1 When examined in vi the character in question, a Registered Sign, appears to be correctly UTF-8 encoded C2AE, and the bib Leader (position 09=a) indicates that it is Unicode encoded. I've attached the MARC record. I noticed that when I run your record (ck245.dat) through MARC::Lint, I get the same invalid record length message: Invalid record length in record 3: Leader says 00567 bytes but it's actually 569 field does not end in end of field character in tag 100 in record 3 field does not end in end of field character in tag 245 in record 3 Invalid indicators ".10" forced to blanks in record 3 for tag 245 field does not end in end of field character in tag 260 in record 3 Invalid indicators ". " forced to blanks in record 3 for tag 260 field does not end in end of field character in tag 300 in record 3 Invalid indicators ". " forced to blanks in record 3 for tag 300 field does not end in end of field character in tag 502 in record 3 Invalid indicators ". " forced to blanks in record 3 for tag 502 field does not end in end of field character in tag 504 in record 3 Invalid indicators ". " forced to blanks in record 3 for tag 504 field does not end in end of field character in tag 690 in record 3 Invalid indicators ". 4" forced to blanks in record 3 for tag 690 Anybody have any ideas? -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Shieh, Jackie [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 19, 2008 10:50 AM > To: perl4lib@perl.org > Subject: Help for utf-8 output > > I was wondering if anyone has similar experience and has come > up with good solutions to help solving the challenge below?! > > What I have is an Excel spreadsheet for dissertations which I > have saved as a tab delimited file (examining the file in > TextPad, the diacritics appears to be fine), then read in and > output the file as a utf-8 MARC file. I <print> title field > confirming author field that contains diacritics with the > title showing proper indicator values. > > But when I looked the MARC itself, the fields that follow the > field containing diacritics are all off its original > position. See attached zip file. Examples below: first two > have diacritics in a 100 field, last one diacritic is in 245 > subfield b) > > 001 diss 34001 > 100 1 _aP<E9>rez, Nancy L. > 245 _aSynchronic and Diachronic Matlatzinkan Phonology. > > 001 diss 34042 > 100 1 _aValent<ED>n-M<E1>rquez, Wilfredo > 245 _aDoing being boricua : > > 001 diss 33892 > 100 1 _aDavis, Jennifer M. > 245 14 _aThe Functional Complexities of Inherited Cardiac > Troponin I Mutations : > _bIdentification of Ca<B2>+ Independent > Contractile Dysfunction. > > I would be greatly appreciate any suggestion to solve this. > Thank you most kindly. > > Regards, > > --Jackie > > |Jackie Shieh > |Data Loads & Development > |Harlan Hatcher Graduate Library > |University of Michigan > |920 North University > |Ann Arbor, MI 48109-1205 > |Phone: 734.763.6070 FAX: 734.615.9788 > |E-mail: JShieh [AT] umich [DOT] edu >