Re: Help for utf-8 output

Leif Andersson Sat, 01 Mar 2008 12:51:47 -0800

It seems there is a little bug (by design) kicking in.

The leader gets wrong and some characters get wrong in this case:
   + Reading a raw marc record (utf8) from file
   + Turning it into a MARC::Record object
   + Without modification writing it out to file.
     Yes. Even without modification the bug manifests itself!


Let's start with code simply copying one record from a file utf8.mrc containing 
one or more marc records. This basic operation not involving MARC::Record  is 
OK.

#!perl -w
use strict;
#
open(IN, "utf8.mrc")  || die "1";
open(OUT, ">out_good.mrc") || die "2";
binmode IN;
binmode OUT;
#
# Read in raw MARC
$/ = "\x1D";
my $marc = <IN>;
print OUT $marc;
__END__

Now, we're adding MARC::Record to the process, along with some debug info.
Example code producing *faulty* record:

#!perl -w
use strict;
use MARC::Record;
use Devel::Peek;
#
open(IN, "utf8.mrc")  || die "1";
open(OUT, ">out_bad.mrc") || die "2";
binmode IN;
binmode OUT;
#
# Read in raw MARC
$/ = "\x1D";
my $marc = <IN>;
Dump($marc);  # the utf8-flag is not on
my $obj  = MARC::Record->new_from_usmarc( $marc );
# Convert back to raw MARC
my $marc2 = $obj->as_usmarc();
Dump($marc2); # the utf8-flag IS on
print OUT $marc2;
__END__


In this case the leader and actual length will not agree, as your utf8 
characters have turned into latin1.
The problem is that $marc2 has the utf8 flag set internally by Perl.
And the conversion on output is made in spite of binmode.

We can get around the problem by either (for instance)
use bytes;
  or
Encode::_utf8_off($marc2);
before printing to file.

But shouldn't MARC::Record take care of this for us?
A file of MARC records may contain records in different encodings.
The text parts of a MARC record can be treated as made up by certain encodings, 
but the "blob" itself, I suppose, should be exposed to the caller as pure 
binary.

Are there any drawbacks in letting MARC::Record strip off any eventual utf8 
flag before returning the record as_usmarc() ?
If not I suggest this change be made to a future release of MARC::Record.

I shall also add that this character mess only sets in when doing IO.
If you are updating your databases through one API or another you are probably 
OK!


Leif
======================================
Leif Andersson, Systems Librarian
Stockholm University Library
SE-106 91 Stockholm
SWEDEN
Phone : +46 8 162769
Mobile: +46 70 6904281

-----Ursprungligt meddelande-----
Från: Doran, Michael D [mailto:[EMAIL PROTECTED] 
Skickat: den 21 februari 2008 18:49
Till: perl4lib@perl.org
Ämne: RE: Help for utf-8 output

Hi Jackie,

I'm working on a very similar problem... converting theses/dissertations 
records (in XML) to MARC records.  I'm still in the testing stage, but have had 
similar problems with records with diacritics in the 100 or 245 fields (however 
diacritics in a 520a field don't seem to cause any problems).  Since our 
records are not "diacritic rich" it's hard to determine the exact extent of the 
problem.

I am using these versions:
  Perl v5.8.8
  MARC::Charset 0.98
  MARC::Lint 1.43
  MARC::Record 2.0
  XML::LibXML 1.66

Here's an example "bad" record (which I have minimized to just the 245 field):

marcdump test.mrc
test.mrc
LDR 00127cam a2200037   4500
245 13 _aAn Empirical Test Of The Situational Leadership® Model In Japan /
       _cRiho Yoshioka.

 Recs  Errs Filename
----- ----- --------
    1     1 test.mrc

When I run test.mrc through MARC::Lint, I get this message:

 Invalid record length in record 1: Leader says 00127 bytes but it's actually 
125
 Invalid length in directory for tag 245 in record 1
 field does not end in end of field character in tag 245 in record 1

When examined in vi the character in question, a Registered Sign, appears to be 
correctly UTF-8 encoded C2AE, and the bib Leader (position 09=a) indicates that 
it is Unicode encoded.  I've attached the MARC record.

I noticed that when I run your record (ck245.dat) through MARC::Lint, I get the 
same invalid record length message:

 Invalid record length in record 3: Leader says 00567 bytes but it's actually 
569
 field does not end in end of field character in tag 100 in record 3
 field does not end in end of field character in tag 245 in record 3
 Invalid indicators ".10" forced to blanks in record 3 for tag 245

 field does not end in end of field character in tag 260 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 260

 field does not end in end of field character in tag 300 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 300

 field does not end in end of field character in tag 502 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 502

 field does not end in end of field character in tag 504 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 504

 field does not end in end of field character in tag 690 in record 3
 Invalid indicators ". 4" forced to blanks in record 3 for tag 690

Anybody have any ideas?

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Shieh, Jackie [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, February 19, 2008 10:50 AM
> To: perl4lib@perl.org
> Subject: Help for utf-8 output
> 
> I was wondering if anyone has similar experience and has come 
> up with good solutions to help solving the challenge below?!
> 
> What I have is an Excel spreadsheet for dissertations which I 
> have saved as a tab delimited file (examining the file in 
> TextPad, the diacritics appears to be fine), then read in and 
> output the file as a utf-8 MARC file. I  <print> title field 
> confirming author field that contains diacritics with the 
> title showing proper indicator values. 
> 
> But when I looked the MARC itself, the fields that follow the 
> field containing diacritics are all off its original 
> position. See attached zip file.  Examples below: first two 
> have diacritics in a 100 field, last one diacritic is in 245 
> subfield b)
> 
> 001     diss 34001
> 100 1  _aP<E9>rez, Nancy L.
> 245     _aSynchronic and Diachronic Matlatzinkan Phonology.
> 
> 001     diss 34042
> 100 1  _aValent<ED>n-M<E1>rquez, Wilfredo
> 245     _aDoing being boricua :
> 
> 001     diss 33892
> 100 1   _aDavis, Jennifer M.
> 245 14 _aThe Functional Complexities of Inherited Cardiac 
> Troponin I Mutations :
>             _bIdentification of Ca<B2>+ Independent 
> Contractile Dysfunction.
> 
> I would be greatly appreciate any suggestion to solve this. 
> Thank you most kindly. 
> 
> Regards, 
>  
> --Jackie 
>  
> |Jackie Shieh
> |Data Loads & Development
> |Harlan Hatcher Graduate Library
> |University of Michigan
> |920 North University
> |Ann Arbor, MI 48109-1205
> |Phone: 734.763.6070 FAX: 734.615.9788
> |E-mail: JShieh [AT] umich [DOT] edu
>

Re: Help for utf-8 output

Reply via email to