Re: MARC-perl: different versions yield different results

Leif Andersson Tue, 12 Oct 2010 09:17:36 -0700

Yes, I know...

and as I mentioned in my answer to Ed you can just add


sub MARC::File::Encode::marc_to_utf8 {
    return Encode::decode( 'UTF-8', $_[0], 0 );
}

to that package MARC_Record_hack

Or make the changes directly in MARC::File::Encode.pm

I do not feel perfectly comfortable myself with changing locally in the .pm 
files
So I have to admit I have also added a comment and a warning that prints out to 
STDERR

Just below
package MARC::File::USMARC;
warn __PACKAGE__ . " has been modified. Added 'use bytes' in function encode at 
line 315\n";
# This is to make Perl treat the record as pure binary. And aviod "Wide 
character in print"-warnings, and
# corrupted character encodings when writing non-utf8 records to file.

And in a similar way to MARC::File::Encoding

That is because I tend to forget so quickly and easily ;-)

/Leif

________________________________________
Från: Al [ra...@berkeley.edu]
Skickat: den 12 oktober 2010 17:45
Till: Leif Andersson; perl4lib@perl.org
Ämne: Re: MARC-perl: different versions yield different results

Thanks, that does indeed do the trick.

   >MARC::Record 2.0.0, the so called unicode version, introduced the
problem you describe.

Good to know. I hadn't gleaned that fact from all the messages I'd read.

I have a second, related question: MARC::Record 2.0.0 and Encode 2.40 are
now more sensitive to leader byte 9. That is, if the leader is set
incorrectly for the record's encoding, the program dies with a Unicode
error. I deal with tens of thousands of records from a variety of sources
and we simply must live with these bad records. I know how to prevent the
program from dying and deal with these records by redefining
Encode::decode() but that's a blanket solution that ignores all Encode
errors. Is there a way to get the program to ignore just the leader 9
mismatch errors (again taking into account the batch will contain a mixture
of MARC 8 and UTF-8 encodings)?

Sample of 5 records with incorrect leader 9:
http://www.mediafire.com/file/4wf5mpa9zba5195/badrecs_sample.zip
The records are kind of large, sorry. The error occurs in the first record
in the 505, sheet 17. Hsèuan-Ch'eng. The 3rd character of the name is a
MARC 8 umlaut, \xE8.

Sample program:

use MARC::Batch;
use bytes;

my $batch = new MARC::Batch('USMARC', $ARGV[0]);
$batch->strict_off ();
$batch->warnings_off ();

my $record = $batch->next;
while ($record) {
    print $record->as_usmarc;
    $record = $batch->next;
}

The later version of MARC::Record will die on the first record. The earlier
version will process them all.

Al


At 10/12/2010, Leif Andersson wrote:
 >This has nothing to do with Perl versions.
 >
 >MARC::Record 1.38 and earlier does not display this problem.
 >MARC::Record 2.0.0, the so called unicode version, introduced the problem
 >you describe.
 >That is when writing records: causing incorrect leader length and corrupted
 >utf-8
 >
 >There are different ways to deal with this.
 >Myself I have changed one of the modules.
 >
 >MARC::File::USMARC
 >It has a function called encode() around line 315
 >I have added a "use bytes;" just before the final return. Like this:
 >
 >use bytes;
 >return join("",$marc->leader, @$directory, END_OF_FIELD, @$fields,
 >END_OF_RECORD);
 >
 >To change directly in code like this is totally "no-no" to many programmers.
 >If you feel uncomfortable with this, there are other methods doing the same
 >stuff.
 >You could write a package:
 >
 >package MARC_Record_hack;
 >use MARC::File::USMARC;
 >no warnings 'redefine';
 >sub MARC::File::USMARC::encode() {
 >    my $marc = shift;
 >    $marc = shift if (ref($marc)||$marc) =~ /^MARC::File/;
 >    my ($fields,$directory,$reclen,$baseaddress) =
 >MARC::File::USMARC::_build_tag_directory($marc);
 >    $marc->set_leader_lengths( $reclen, $baseaddress );
 >    # Glomp it all together
 >    use bytes;
 >    return join("",$marc->leader, @$directory, "\x1E", @$fields, "\x1D");
 >}
 >use warnings;
 >1;
 >__END__
 >
 >With the inclusion of this package your original code should work fine, I'd
 >guess.
 >
 >
 >use MARC::Batch;
 >use MARC_Record_hack;
 >my $batch = new MARC::Batch('USMARC', $ARGV[0]);
 >$batch->strict_off ();
 >$batch->warnings_off ();
 >#binmode( STDOUT, ':raw' );
 >#binmode STDOUT;
 >my $record = $batch->next;
 >print $record->as_usmarc;
 >
 >
 >As a habit I use
 >binmode FH;
 >when I write records to file.
 >It is not needed, but it keeps me from the temptation of doing any other
 >assumptions about character encodings.
 >
 >/Leif Andersson
 >Stockholm University Library
 >
 >________________________________________
 >Från: Al [ra...@berkeley.edu]
 >Skickat: den 12 oktober 2010 00:03
 >Till: perl4lib@perl.org
 >Ämne: MARC-perl: different versions yield different results
 >
 >Example marc record is here:
 >http://www.mediafire.com/file/u5cxkrfwh9ew09z/example.zip
 >
 >When I process the record above in perl 5.8, MARC::Record version 1.38, and
 >Encode.pm version 2.12, the record comes out fine.
 >
 >When I use perl 5.10, MARC::Record version 2.0.0, and Encode.pm 2.40 the
 >record comes out corrupted and MARC::Record will no longer read the result.
 >
 >The problem is with a Unicode character (big surprise). The earlier version
 >leaves the \xC3A1 character intact, the later version changes it to \xE1
 >which is invalid. I've read as many of the perl4lib messages on the subject
 >of UTF-8 as I could but my eyes are spinning. I'm hoping by including a
 >complete but simple perl program and making a MARC record available that
 >somebody can explain to me in detail what is going on. My inclination is to
 >simply revert to the earlier version of perl but perhaps if I really
 >understood the issue that may not be necessary.
 >
 >Here is the test program I use:
 >
 >use MARC::Batch;
 >my $batch = new MARC::Batch('USMARC', $ARGV[0]);
 >$batch->strict_off ();
 >$batch->warnings_off ();
 >#binmode( STDOUT, ':utf8' );
 >my $record = $batch->next;
 >print $record->as_usmarc;
 >
 >Run the program on the record, then run it again on the output and the
 >second time perl quits with an error:
 >
 >utf8 "\xE1" does not map to Unicode at Encode.pm line 174.
 >
 >That should not happen.
 >
 >Why the different behavior with the different versions? I can't see
 >anything wrong with the original record - it's valid UTF8 as far as I can
 >tell. Leader byte 9 is correctly set to 'a'. Uncommenting the binmode line
 >seems to work - the character is output unchanged as is supposed to happen.
 >The problem is my record batches are a mixture of UTF8 and MARC8 and
 >explicitly setting binmode screws things up. I need a solution that
 >transparently handles a mix of record encodings.
 >
 >I rather suspect the problem is with Encode.pm and not MARC perl but I
 >can't be sure. It also may be due to the way perl handles IO between
 >version 5.8 and 5.10. BTW the problem happens on Windows and Unix.
 >
 >Thanks for any advice you can give me,
 >
 >Al

Re: MARC-perl: different versions yield different results

Reply via email to