MARC::File::XML and parsing.

2007-09-26 Thread Henri-Damien LAURENT
hi,
I have some problems with Marc::File::XML parser.

Take those two xml records.
Despite the fact that I agree that there are odd characters in some
subfields.
I am wondering why, since those characters are UTF8, MARC::File::XML
should drop them when parsing.
Is there a reason why MARC::File::XML considers only a very strict
subset of utf-8 as valid ? (For instance no linebreak, no ...) ?

Couldnot it  say "OK It is XML record, encoded UTF8, i take it for
granted and no matter if there are "odd" characters" ?
This could be a really BIG trouble for kanjis or hindu languages imho.



http://www.w3.org/2001/XMLSchema-instance";
 xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/
standards/marcxml/schema/MARC21slim.xsd"
 xmlns="http://www.loc.gov/MARC21/slim";>

 00150nx  a2200073   4500 
 
   Nicolas
   Jérôme
   Traducteur
 
 
   19980124afrey50  ba0
 
 3568
 
   NP
 


http://www.w3.org/2001/XMLSchema-instance";
 xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/
standards/marcxml/schema/MARC21slim.xsd"
 xmlns="http://www.loc.gov/MARC21/slim";>

 00151nx  a2200073   4500 
 
   Guynemer
   Georges
   (1894-1917)
 
 
   19980129afrey50  ba0
 
 4642
 
   NP
 


-- 
Henri Damien LAURENT et Paul POULAIN
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)




MARC::File::XML odd conversion to usmarc

2007-09-26 Thread Henri-Damien LAURENT
xml record :

http://www.w3.org/2001/XMLSchema-instance";
 xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/
standards/marcxml/schema/MARC21slim.xsd"
 xmlns="http://www.loc.gov/MARC21/slim";>

 00154nx  a2200073   4500 
 
   Cohen
   Bernard
   1956-
   Traducteur
 
 
   19980227afrey50  ba0
 
 14286
 
   NP
 


00162lbána2200073uel4500246012900046001000600075152000700081.
1.a.  .Cohen.bBernard.f1956-.4Traducteur.  .a19980227afrey50
ba0.14286.  .bNP..

Here is part of my code :
my $record;
eval {
$record =
MARC::Record->new_from_xml($marcxml,'UTF-8',"UNIMARCAUTH");
$record->encoding('UTF-8');
};
if($@){
print "  There was some pb getting authority :
".$authid."\n";
}
my $leader=$record->leader;
substr($leader,0,5)=' ';
substr($leader,10,7)='22 ';
$record->leader(substr($leader,0,24));
print $record->as_usmarc();

But it was not a single record. It was part of a batch.
And it seems that trying to decode the marc record on its own leads to
problems.

Anyway, why is there a change in leader  from 00154nx  a2200073   4500
to 00162lbána2200073uel4500 ?
This leads to problem with MARC decoding.
Is there something to do about it ?
Coud MARC::File::XML be a little more verbose about errors ?

-- 
Henri Damien LAURENT et Paul POULAIN
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)