[Much of the discussion of record parsers has very little to do with the subject of Solr/Lucene specifically under which preceding discussion of record parsers appeared.]
Reply inline: Previous Subject: Re: [Koha-devel] Search Engine Changes : let's get some solr 1. GENERAL PUROSE XML PARSER. On Mon, November 15, 2010 13:58, Ian Walls wrote: > Just to throw in on something I ready earlier in this thread, I'd say that > for a general practice with Koha going forward, we should pick a single > XML > parser that can handle arbitrary schemas, and use that. Having a general purpose XML parser would be very useful as one step towards greater generalisation and abstraction in Koha. Picking a single XML parser for all use cases might be an optimisation mistake which we would come to regret in future. 2. METADATA SCHEMA AGNOSTIC RECORDS. > I would very much > like to make Koha not just MARC-agnostic, but metadata schema agnostic, > and > coding ourselves into a corner now (even for a noticeable performance > boost), would make life difficult later. As I think the rest of the > thread > attests, there are other ways to improve our XML parsing. > > If this had already been resolved earlier in the conversation, I apologize > for redundancy; I haven't had my morning coffee yet. The issue of a general purpose XML parser had been considered tangentially but without the appropriate context of metadata schema agnostic records. I think that considering record parsers which are not MARC or MARCXML specific is important for long term development. 2.1. INTERNAL RECORD FORMATS. For some future development, Koha should not be dependent upon a metadata exchange record syntax for anything other than lossless data input and data output. An internal record syntax should be optimised for particular library management system functions. The general state of Koha may not be ready for the work which would be required to ensure that changing the base record format would be lossless. However, we should be enabling the future possibility by implementing abstraction when opportunities arise. Frédéric Demians recognises the distinction between internal record use for Koha and external record use for interfacing with the world. Previous discussion in the "MARC record size limit thread" had also considered non-XML record syntaxes such as YAML. On Mon, November 15, 2010 05:56, Frédéric Demians wrote: [...] > It's a design choice. MARCXML is the Koha internal serialization format > for MARC records. There is no obligation to conform to MARC21slim > schema. We even could choose another serialization format as it has > already been discussed. biblioitems.marcxml isn't open to the wide. [...] > And we could benefit of it if > pure Perl parsing is a real performance gain. That is for the good > reason. However, the prospect of using Koha specific record syntax parser for record creation or modification scares me. I would much prefer some lower efficiency with validity constraints from a Perl module widely tested outside of Koha. 2.1.1. REASON FOR INTERNAL RECORD FORMATS. An example record format is record format optimised for indexing which would store information such as the language of material a clear appropriate place for indexing. Records optimised for indexing would be different from the primary form of the record optimised editing and an alternate form optimised for display. MARC often uses one or more of several different places with varying forms of presentation for the same information. Examples include language of material which may be multiple and refer to language from which material was translated; the muddle of recording content type, material type, carrier type and their various relationships; the muddle of date forms and similar numeric and sequential designators; the muddle of ordered classification and similar hierarchical designators; transcribed and natural language record content with no controlled vocabulary; etc. [In the interest of time, I omit providing detailed examples.] Consider the case of language of material. Enhancing records to use fixed fields or fixed subfields for better indexing is insufficient to record the complexity of language use cases. XPath indexing of MARC records cannot cope well enough with all the possibilities. The information can be parsed out of MARC records reliably into a record specially optimised for indexing. Storing the information in MARC in an easily indexable manner is the problem. 3. ENABLING FUTURE DEVELOPMENT. Generalising and abstracting record parsing would enable future development such as records normalised for a particular purpose without being dependent upon MARC. Developments which enable future work do not require a commitment to a particular development idea but help free the constraints of development practicalities by leaving less work to provide some future development. Thomas Dukleth Agogme 109 E 9th Street, 3D New York, NY 10003 USA http://www.agogme.com +1 212-674-3783 _______________________________________________ Koha-devel mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
