MARC::Record and UTF-8 & related threads

Anne L. Highsmith Fri, 04 Mar 2005 07:20:27 -0800

(Apologies in advance if this ends up going out twice; I am resending from a 
different email account)

In December and January several messages were sent to this list discussing 
aspects of the above topics.  I was concerned about the discussion, because I 
do a lot of marc record processing for records that are already in the database 
and am scheduled to have the database converted to Unicode in mid-May. If I 
understood the discussion correctly, particularly the point that Ed is making 
below, I was faced with the prospect of all of those programs not working any 
longer because MARC::Record cannot handle Unicode--the existing versions of the 
module would not handle directory lengths properly.  And whatever may have been 
done since January, I'm pretty sure that the version of MARC::Record that I'm 
using will not handle Unicode, because it is somewhat elderly, i.e. something 
like 1.29.

Here's my main question -- is that the principal concern/question/problem, i.e. 
that directory lengths will not be computed correctly using the existing  
MARC::Record module with a Unicode record?  Or is it only in certain situations 
that the directory length would not be computed correctly?

In order to educate myself somewhat on this point, I compared raw MARC record 
versions of the same record from my production database, which has not been 
converted to Unicode, and my test database, which has.  The record in question 
has three diacritics in the 245 field.  The length segment of the directory 
entry for that field in the Unicode version of the record is 115, while in the 
non-Unicode version of the record, it is 112.  I did an experiment -- I started 
with the Unicode version of the record and modified it by deleting one 
character from the 245 field.  I then reloaded it into the Unicode database.  
The database doesn't have any trouble with it -- it loaded OK, displays 
correctly in the cataloging client, and updates correctly online.  Is this the 
behavior you would expect?

So in this case, it appears to have worked okay. Under what conditions would 
you expect it NOT to work okay? In situations where you are actually trying to 
futz (highly technical term) with the diacritics themselves?

If anyone is inspired to make the necessary updates to the MARC::Record module 
to handle unicode records, I'd certainly be happy to test. I'd also be 
eternally grateful, since my alternative might be re-writing 8 or 10 job 
streams in the next 10 weeks so that I can: 1) export the records from my 
database in MARC8; 2) edit them; 3) reload them doing a MARC8-Unicode 
conversion utility provided by the lms vendor.

>>> Ed Summers <[EMAIL PROTECTED]> 01/07/05 08:54AM >>>
On Fri, Jan 07, 2005 at 08:53:40AM +0100, Ron Davies wrote:
> I will have a similar project in a few months' time, converting a whole 
> bunch of processing from MARC-8 to UTF-8. I would be very happy to assist 
> in testing or development of a UTF-8 capability for MARC::Record. Is the 
> problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) 
> the only known issue?

Correct. A few months ago I hacked at MARC::Record to try to get it to
use utf8 for platforms that support perl >= 5.8.

I backed out these changes because my initial implememtation proved to
be faulty. Essentially I treated all data as utf8 if perl was >= 5.8
... but this didn't work out since some valid MARC-8 data is invalid
UTF-8. I was bummed. 

The problem (as Ron correctly points out) is that the Perl function length() 
is being used to construct the byte offsets in the record directory. This 
works fine when a character is a byte, but breaks badly on utf8 data since a 
character is more than one byte.

Fortunately there is the bytes pragma which was introduced in 5.6 which
has a bytes::length() function which computes the correct length. I
belive that bytes::length() was introduced in 5.8 somewhere, it was
added on later.

I wanted MARC::Record to do the right thing based on position 9 in the
leader. But I don't know if this is feasible. Perhaps simply having a
flag when you create the MARC::Record, MARC::Batch or MARC::File::USMARC
objects will be enough.

    my $batch = MARC::Batch( 'USMARC', 'file.dat', utf8 => 1 );

or

    my $record = MARC::Record->new( utf8 => 1 );

Comments, thoughts, hacks welcome :-) This shouldn't be too tough, it
just needs some concentrated attention.

//Ed

MARC::Record and UTF-8 & related threads

Reply via email to