Hi, 2010/11/14 Frédéric Demians <[email protected]>: > MARCXML parsing is slow because MARC::File::XML uses a SAX parser to do the > job and do some 'magic' encoding-decoding to-from MARC8--Galen could correct > me if I'm wrong.
Properly used (i.e., with BinaryEncoding => utf8 when parsing known UTF-8 MARCXML records), MARC::File::XML doesn't automatically transcode from MARC-8 to UTF-8, so that's a nonissue. Admittedly, there are still some circumstances where MARC::File::XML does inappropriately try to inject a MARC-8 to UTF-8 conversion. Patches to improve MARC::File::XML are welcome. > But since records stored into Koha are cleanly UTF-8 > encoded, are well formed XML and respect a minimalist schema, That is the ideal. In practice, Koha currently does not enforce either of your two assumptions in that statement; patches to tighten that up would be a good idea. > we could parse > them much more quickly directly in Perl. I question whether a pure Perl implementation would be faster. LibXML::XML::SAX, XML::SAX::Expat, and XML::SAX::ExpatXS have the the advantage that much of the parsing is handled by C code. > I've done some experimentation. It > works easily. This code could be ported in five minutes: > > http://tinyurl.com/3x3d6b9 Are you suggesting that we adopt yet another hand-crafted, pure Perl XML parser, one that does not support namespaces (a lot of MARCXML data in the wild does reference the marc namespace) and does not check for well formed XML *and* adopt a new MARC module that appears to be all of a few days old and lacks test cases for use in Koha? What you propose is interesting, and I'm sure you'll pursue it, but it would need more time to bake. On a more general note, XML parsing is a (mostly) solved problem in Perl. I don't think the way forward is to interpose hand-crafted pure-Perl-parsing of the MARCXML. To suggest an alternative approach that I think would would bear fruit and be less error prone, we can try other standard XML parsers such as XML::Twig. Regards, Galen -- Galen Charlton [email protected] _______________________________________________ Koha-devel mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
