Yes, I know... and as I mentioned in my answer to Ed you can just add
sub MARC::File::Encode::marc_to_utf8 { return Encode::decode( 'UTF-8', $_[0], 0 ); } to that package MARC_Record_hack Or make the changes directly in MARC::File::Encode.pm I do not feel perfectly comfortable myself with changing locally in the .pm files So I have to admit I have also added a comment and a warning that prints out to STDERR Just below package MARC::File::USMARC; warn __PACKAGE__ . " has been modified. Added 'use bytes' in function encode at line 315\n"; # This is to make Perl treat the record as pure binary. And aviod "Wide character in print"-warnings, and # corrupted character encodings when writing non-utf8 records to file. And in a similar way to MARC::File::Encoding That is because I tend to forget so quickly and easily ;-) /Leif ________________________________________ Från: Al [ra...@berkeley.edu] Skickat: den 12 oktober 2010 17:45 Till: Leif Andersson; perl4lib@perl.org Ämne: Re: MARC-perl: different versions yield different results Thanks, that does indeed do the trick. >MARC::Record 2.0.0, the so called unicode version, introduced the problem you describe. Good to know. I hadn't gleaned that fact from all the messages I'd read. I have a second, related question: MARC::Record 2.0.0 and Encode 2.40 are now more sensitive to leader byte 9. That is, if the leader is set incorrectly for the record's encoding, the program dies with a Unicode error. I deal with tens of thousands of records from a variety of sources and we simply must live with these bad records. I know how to prevent the program from dying and deal with these records by redefining Encode::decode() but that's a blanket solution that ignores all Encode errors. Is there a way to get the program to ignore just the leader 9 mismatch errors (again taking into account the batch will contain a mixture of MARC 8 and UTF-8 encodings)? Sample of 5 records with incorrect leader 9: http://www.mediafire.com/file/4wf5mpa9zba5195/badrecs_sample.zip The records are kind of large, sorry. The error occurs in the first record in the 505, sheet 17. Hsèuan-Ch'eng. The 3rd character of the name is a MARC 8 umlaut, \xE8. Sample program: use MARC::Batch; use bytes; my $batch = new MARC::Batch('USMARC', $ARGV[0]); $batch->strict_off (); $batch->warnings_off (); my $record = $batch->next; while ($record) { print $record->as_usmarc; $record = $batch->next; } The later version of MARC::Record will die on the first record. The earlier version will process them all. Al At 10/12/2010, Leif Andersson wrote: >This has nothing to do with Perl versions. > >MARC::Record 1.38 and earlier does not display this problem. >MARC::Record 2.0.0, the so called unicode version, introduced the problem >you describe. >That is when writing records: causing incorrect leader length and corrupted >utf-8 > >There are different ways to deal with this. >Myself I have changed one of the modules. > >MARC::File::USMARC >It has a function called encode() around line 315 >I have added a "use bytes;" just before the final return. Like this: > >use bytes; >return join("",$marc->leader, @$directory, END_OF_FIELD, @$fields, >END_OF_RECORD); > >To change directly in code like this is totally "no-no" to many programmers. >If you feel uncomfortable with this, there are other methods doing the same >stuff. >You could write a package: > >package MARC_Record_hack; >use MARC::File::USMARC; >no warnings 'redefine'; >sub MARC::File::USMARC::encode() { > my $marc = shift; > $marc = shift if (ref($marc)||$marc) =~ /^MARC::File/; > my ($fields,$directory,$reclen,$baseaddress) = >MARC::File::USMARC::_build_tag_directory($marc); > $marc->set_leader_lengths( $reclen, $baseaddress ); > # Glomp it all together > use bytes; > return join("",$marc->leader, @$directory, "\x1E", @$fields, "\x1D"); >} >use warnings; >1; >__END__ > >With the inclusion of this package your original code should work fine, I'd >guess. > > >use MARC::Batch; >use MARC_Record_hack; >my $batch = new MARC::Batch('USMARC', $ARGV[0]); >$batch->strict_off (); >$batch->warnings_off (); >#binmode( STDOUT, ':raw' ); >#binmode STDOUT; >my $record = $batch->next; >print $record->as_usmarc; > > >As a habit I use >binmode FH; >when I write records to file. >It is not needed, but it keeps me from the temptation of doing any other >assumptions about character encodings. > >/Leif Andersson >Stockholm University Library > >________________________________________ >Från: Al [ra...@berkeley.edu] >Skickat: den 12 oktober 2010 00:03 >Till: perl4lib@perl.org >Ämne: MARC-perl: different versions yield different results > >Example marc record is here: >http://www.mediafire.com/file/u5cxkrfwh9ew09z/example.zip > >When I process the record above in perl 5.8, MARC::Record version 1.38, and >Encode.pm version 2.12, the record comes out fine. > >When I use perl 5.10, MARC::Record version 2.0.0, and Encode.pm 2.40 the >record comes out corrupted and MARC::Record will no longer read the result. > >The problem is with a Unicode character (big surprise). The earlier version >leaves the \xC3A1 character intact, the later version changes it to \xE1 >which is invalid. I've read as many of the perl4lib messages on the subject >of UTF-8 as I could but my eyes are spinning. I'm hoping by including a >complete but simple perl program and making a MARC record available that >somebody can explain to me in detail what is going on. My inclination is to >simply revert to the earlier version of perl but perhaps if I really >understood the issue that may not be necessary. > >Here is the test program I use: > >use MARC::Batch; >my $batch = new MARC::Batch('USMARC', $ARGV[0]); >$batch->strict_off (); >$batch->warnings_off (); >#binmode( STDOUT, ':utf8' ); >my $record = $batch->next; >print $record->as_usmarc; > >Run the program on the record, then run it again on the output and the >second time perl quits with an error: > >utf8 "\xE1" does not map to Unicode at Encode.pm line 174. > >That should not happen. > >Why the different behavior with the different versions? I can't see >anything wrong with the original record - it's valid UTF8 as far as I can >tell. Leader byte 9 is correctly set to 'a'. Uncommenting the binmode line >seems to work - the character is output unchanged as is supposed to happen. >The problem is my record batches are a mixture of UTF8 and MARC8 and >explicitly setting binmode screws things up. I need a solution that >transparently handles a mix of record encodings. > >I rather suspect the problem is with Encode.pm and not MARC perl but I >can't be sure. It also may be due to the way perl handles IO between >version 5.8 and 5.10. BTW the problem happens on Windows and Unix. > >Thanks for any advice you can give me, > >Al