MARC::File::XML and parsing.
hi, I have some problems with Marc::File::XML parser. Take those two xml records. Despite the fact that I agree that there are odd characters in some subfields. I am wondering why, since those characters are UTF8, MARC::File::XML should drop them when parsing. Is there a reason why MARC::File::XML considers only a very strict subset of utf-8 as valid ? (For instance no linebreak, no ...) ? Couldnot it say "OK It is XML record, encoded UTF8, i take it for granted and no matter if there are "odd" characters" ? This could be a really BIG trouble for kanjis or hindu languages imho. http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd" xmlns="http://www.loc.gov/MARC21/slim";> 00150nx a2200073 4500 Nicolas Jérôme Traducteur 19980124afrey50 ba0 3568 NP http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd" xmlns="http://www.loc.gov/MARC21/slim";> 00151nx a2200073 4500 Guynemer Georges (1894-1917) 19980129afrey50 ba0 4642 NP -- Henri Damien LAURENT et Paul POULAIN Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
MARC::File::XML odd conversion to usmarc
xml record : http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd" xmlns="http://www.loc.gov/MARC21/slim";> 00154nx a2200073 4500 Cohen Bernard 1956- Traducteur 19980227afrey50 ba0 14286 NP 00162lbána2200073uel4500246012900046001000600075152000700081. 1.a. .Cohen.bBernard.f1956-.4Traducteur. .a19980227afrey50 ba0.14286. .bNP.. Here is part of my code : my $record; eval { $record = MARC::Record->new_from_xml($marcxml,'UTF-8',"UNIMARCAUTH"); $record->encoding('UTF-8'); }; if($@){ print " There was some pb getting authority : ".$authid."\n"; } my $leader=$record->leader; substr($leader,0,5)=' '; substr($leader,10,7)='22 '; $record->leader(substr($leader,0,24)); print $record->as_usmarc(); But it was not a single record. It was part of a batch. And it seems that trying to decode the marc record on its own leads to problems. Anyway, why is there a change in leader from 00154nx a2200073 4500 to 00162lbána2200073uel4500 ? This leads to problem with MARC decoding. Is there something to do about it ? Coud MARC::File::XML be a little more verbose about errors ? -- Henri Damien LAURENT et Paul POULAIN Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org)