> I was under the impression that the MARC record length in the
> Leader was the record length in bytes rather than the number
> of characters.
According to this source, the Leader record length is in bytes:
MARC Leader > record length = "Five numeric characters equal
to the total number of bytes in the logical record" [1]
I also checked my charset mail folder and found this in a message from way back
in 2003:
"...there is some difficulty computing the record length properly,
since MARC::Record uses character length, rather than byte length,
which are the same thing when you are dealing with 8 bit characters."
-- Ed Summers [2]
I looked through the MARC::Record CHANGES file [3]. Although there are some
enhancements/fixes regarding the use of UTF-8, I don't see anything that
explicitely states that more current versions of MARC::Record now compute the
record length in bytes. It seems like that would be a good thing.
-- Michael
[1] MARC 21 Record Builder
http://www.loc.gov/marc/marc2onix.html
[2] "MARC-Charset-0.5 questions" July 2003 thread on perl4lib
[3] CHANGES : Revision history for Perl extension MARC::Record.
http://search.cpan.org/src/MIKERY/MARC-Record-2.0.0/Changes
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
> -----Original Message-----
> From: Doran, Michael D
> Sent: Monday, March 03, 2008 10:36 AM
> To: 'Leif Andersson'; [email protected]
> Subject: RE: Help for utf-8 output
>
> Hi Leif,
>
> I really appreciate you taking a look at this and responding.
> Although I consider myself somewhat knowledgeable about
> character sets, I still find these kinds of problems to be confusing.
>
> > In this case the leader and actual length will not agree,
> as your utf8
> > characters have turned into latin1.
>
> I was under the impression that the MARC record length in the
> Leader was the record length in bytes rather than the number
> of characters. Is that your understanding?
>
> Also, I am still troubleshooting my particular set of records
> (I was out of town last week) since this problem only appears
> to manifest itself for records with non-ASCII characters in
> the 100 and 245 fields. Records with a note field having
> non-ASCII characters doesn't cause a problem.
>
> -- Michael
>
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>
>
> > -----Original Message-----
> > From: Leif Andersson [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, March 01, 2008 2:51 PM
> > To: Doran, Michael D; [email protected]; [EMAIL PROTECTED]
> > Subject: Re: Help for utf-8 output
> >
> > It seems there is a little bug (by design) kicking in.
> >
> > The leader gets wrong and some characters get wrong in this case:
> > + Reading a raw marc record (utf8) from file
> > + Turning it into a MARC::Record object
> > + Without modification writing it out to file.
> > Yes. Even without modification the bug manifests itself!
> >
> > Let's start with code simply copying one record from a file
> utf8.mrc
> > containing one or more marc records. This basic operation not
> > involving MARC::Record is OK.
> >
> > #!perl -w
> > use strict;
> > #
> > open(IN, "utf8.mrc") || die "1";
> > open(OUT, ">out_good.mrc") || die "2"; binmode IN; binmode OUT; # #
> > Read in raw MARC $/ = "\x1D"; my $marc = <IN>; print OUT $marc;
> > __END__
> >
> > Now, we're adding MARC::Record to the process, along with
> some debug
> > info.
> > Example code producing *faulty* record:
> >
> > #!perl -w
> > use strict;
> > use MARC::Record;
> > use Devel::Peek;
> > #
> > open(IN, "utf8.mrc") || die "1";
> > open(OUT, ">out_bad.mrc") || die "2";
> > binmode IN;
> > binmode OUT;
> > #
> > # Read in raw MARC
> > $/ = "\x1D";
> > my $marc = <IN>;
> > Dump($marc); # the utf8-flag is not on my $obj =
> > MARC::Record->new_from_usmarc( $marc ); # Convert back to
> raw MARC my
> > $marc2 = $obj->as_usmarc(); Dump($marc2); # the utf8-flag
> IS on print
> > OUT $marc2; __END__
> >
> >
> > In this case the leader and actual length will not agree,
> as your utf8
> > characters have turned into latin1.
> > The problem is that $marc2 has the utf8 flag set internally by Perl.
> > And the conversion on output is made in spite of binmode.
> >
> > We can get around the problem by either (for instance) use bytes;
> > or
> > Encode::_utf8_off($marc2);
> > before printing to file.
> >
> > But shouldn't MARC::Record take care of this for us?
> > A file of MARC records may contain records in different encodings.
> > The text parts of a MARC record can be treated as made up
> by certain
> > encodings, but the "blob" itself, I suppose, should be
> exposed to the
> > caller as pure binary.
> >
> > Are there any drawbacks in letting MARC::Record strip off
> any eventual
> > utf8 flag before returning the record as_usmarc() ?
> > If not I suggest this change be made to a future release of
> > MARC::Record.
> >
> > I shall also add that this character mess only sets in when
> doing IO.
> > If you are updating your databases through one API or
> another you are
> > probably OK!
> >
> >
> > Leif
> > ======================================
> > Leif Andersson, Systems Librarian
> > Stockholm University Library
> > SE-106 91 Stockholm
> > SWEDEN
> > Phone : +46 8 162769
> > Mobile: +46 70 6904281
> >
> > -----Ursprungligt meddelande-----
> > Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> > Skickat: den 21 februari 2008 18:49
> > Till: [email protected]
> > Ämne: RE: Help for utf-8 output
> >
> > Hi Jackie,
> >
> > I'm working on a very similar problem... converting
> > theses/dissertations records (in XML) to MARC records. I'm
> still in
> > the testing stage, but have had similar problems with records with
> > diacritics in the 100 or 245 fields (however diacritics in a 520a
> > field don't seem to cause any problems). Since our records are not
> > "diacritic rich" it's hard to determine the exact extent of the
> > problem.
> >
> > I am using these versions:
> > Perl v5.8.8
> > MARC::Charset 0.98
> > MARC::Lint 1.43
> > MARC::Record 2.0
> > XML::LibXML 1.66
> >
> > Here's an example "bad" record (which I have minimized to
> just the 245
> > field):
> >
> > marcdump test.mrc
> > test.mrc
> > LDR 00127cam a2200037 4500
> > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In
> > Japan /
> > _cRiho Yoshioka.
> >
> > Recs Errs Filename
> > ----- ----- --------
> > 1 1 test.mrc
> >
> > When I run test.mrc through MARC::Lint, I get this message:
> >
> > Invalid record length in record 1: Leader says 00127 bytes
> but it's
> > actually 125 Invalid length in directory for tag
> > 245 in record 1 field does not end in end of field
> character in tag
> > 245 in record 1
> >
> > When examined in vi the character in question, a Registered Sign,
> > appears to be correctly UTF-8 encoded C2AE, and the bib Leader
> > (position 09=a) indicates that it is Unicode encoded.
> > I've attached the MARC record.
> >
> > I noticed that when I run your record (ck245.dat) through
> MARC::Lint,
> > I get the same invalid record length message:
> >
> > Invalid record length in record 3: Leader says 00567 bytes
> but it's
> > actually 569 field does not end in end of field character
> in tag 100
> > in record 3 field does not end in end of field character
> in tag 245
> > in record 3 Invalid indicators ".10" forced to blanks in
> record 3 for
> > tag 245
> >
> > field does not end in end of field character in tag 260 in
> record 3
> > Invalid indicators ". " forced to blanks in record
> > 3 for tag 260
> >
> > field does not end in end of field character in tag 300 in
> record 3
> > Invalid indicators ". " forced to blanks in record
> > 3 for tag 300
> >
> > field does not end in end of field character in tag 502 in
> record 3
> > Invalid indicators ". " forced to blanks in record
> > 3 for tag 502
> >
> > field does not end in end of field character in tag 504 in
> record 3
> > Invalid indicators ". " forced to blanks in record
> > 3 for tag 504
> >
> > field does not end in end of field character in tag 690 in
> record 3
> > Invalid indicators ". 4" forced to blanks in record
> > 3 for tag 690
> >
> > Anybody have any ideas?
> >
> > -- Michael
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/
> >
> >
> > > -----Original Message-----
> > > From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, February 19, 2008 10:50 AM
> > > To: [email protected]
> > > Subject: Help for utf-8 output
> > >
> > > I was wondering if anyone has similar experience and has
> > come up with
> > > good solutions to help solving the challenge below?!
> > >
> > > What I have is an Excel spreadsheet for dissertations
> which I have
> > > saved as a tab delimited file (examining the file in TextPad, the
> > > diacritics appears to be fine), then read in and output the
> > file as a
> > > utf-8 MARC file. I <print> title field confirming author
> > field that
> > > contains diacritics with the title showing proper
> indicator values.
> > >
> > > But when I looked the MARC itself, the fields that follow
> the field
> > > containing diacritics are all off its original position.
> > See attached
> > > zip file. Examples below: first two have diacritics in a
> > 100 field,
> > > last one diacritic is in 245 subfield b)
> > >
> > > 001 diss 34001
> > > 100 1 _aP<E9>rez, Nancy L.
> > > 245 _aSynchronic and Diachronic Matlatzinkan Phonology.
> > >
> > > 001 diss 34042
> > > 100 1 _aValent<ED>n-M<E1>rquez, Wilfredo
> > > 245 _aDoing being boricua :
> > >
> > > 001 diss 33892
> > > 100 1 _aDavis, Jennifer M.
> > > 245 14 _aThe Functional Complexities of Inherited Cardiac
> > Troponin I
> > > Mutations :
> > > _bIdentification of Ca<B2>+ Independent Contractile
> > > Dysfunction.
> > >
> > > I would be greatly appreciate any suggestion to solve this.
> > > Thank you most kindly.
> > >
> > > Regards,
> > >
> > > --Jackie
> > >
> > > |Jackie Shieh
> > > |Data Loads & Development
> > > |Harlan Hatcher Graduate Library
> > > |University of Michigan
> > > |920 North University
> > > |Ann Arbor, MI 48109-1205
> > > |Phone: 734.763.6070 FAX: 734.615.9788
> > > |E-mail: JShieh [AT] umich [DOT] edu
> > >
> >