(Apologies in advance if this ends up going out twice; I am resending from a different email account)
In December and January several messages were sent to this list discussing aspects of the above topics. I was concerned about the discussion, because I do a lot of marc record processing for records that are already in the database and am scheduled to have the database converted to Unicode in mid-May. If I understood the discussion correctly, particularly the point that Ed is making below, I was faced with the prospect of all of those programs not working any longer because MARC::Record cannot handle Unicode--the existing versions of the module would not handle directory lengths properly. And whatever may have been done since January, I'm pretty sure that the version of MARC::Record that I'm using will not handle Unicode, because it is somewhat elderly, i.e. something like 1.29. Here's my main question -- is that the principal concern/question/problem, i.e. that directory lengths will not be computed correctly using the existing MARC::Record module with a Unicode record? Or is it only in certain situations that the directory length would not be computed correctly? In order to educate myself somewhat on this point, I compared raw MARC record versions of the same record from my production database, which has not been converted to Unicode, and my test database, which has. The record in question has three diacritics in the 245 field. The length segment of the directory entry for that field in the Unicode version of the record is 115, while in the non-Unicode version of the record, it is 112. I did an experiment -- I started with the Unicode version of the record and modified it by deleting one character from the 245 field. I then reloaded it into the Unicode database. The database doesn't have any trouble with it -- it loaded OK, displays correctly in the cataloging client, and updates correctly online. Is this the behavior you would expect? So in this case, it appears to have worked okay. Under what conditions would you expect it NOT to work okay? In situations where you are actually trying to futz (highly technical term) with the diacritics themselves? If anyone is inspired to make the necessary updates to the MARC::Record module to handle unicode records, I'd certainly be happy to test. I'd also be eternally grateful, since my alternative might be re-writing 8 or 10 job streams in the next 10 weeks so that I can: 1) export the records from my database in MARC8; 2) edit them; 3) reload them doing a MARC8-Unicode conversion utility provided by the lms vendor. >>> Ed Summers <[EMAIL PROTECTED]> 01/07/05 08:54AM >>> On Fri, Jan 07, 2005 at 08:53:40AM +0100, Ron Davies wrote: > I will have a similar project in a few months' time, converting a whole > bunch of processing from MARC-8 to UTF-8. I would be very happy to assist > in testing or development of a UTF-8 capability for MARC::Record. Is the > problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) > the only known issue? Correct. A few months ago I hacked at MARC::Record to try to get it to use utf8 for platforms that support perl >= 5.8. I backed out these changes because my initial implememtation proved to be faulty. Essentially I treated all data as utf8 if perl was >= 5.8 ... but this didn't work out since some valid MARC-8 data is invalid UTF-8. I was bummed. The problem (as Ron correctly points out) is that the Perl function length() is being used to construct the byte offsets in the record directory. This works fine when a character is a byte, but breaks badly on utf8 data since a character is more than one byte. Fortunately there is the bytes pragma which was introduced in 5.6 which has a bytes::length() function which computes the correct length. I belive that bytes::length() was introduced in 5.8 somewhere, it was added on later. I wanted MARC::Record to do the right thing based on position 9 in the leader. But I don't know if this is feasible. Perhaps simply having a flag when you create the MARC::Record, MARC::Batch or MARC::File::USMARC objects will be enough. my $batch = MARC::Batch( 'USMARC', 'file.dat', utf8 => 1 ); or my $record = MARC::Record->new( utf8 => 1 ); Comments, thoughts, hacks welcome :-) This shouldn't be too tough, it just needs some concentrated attention. //Ed