Hi all, as discussed in the last IRC Dev meeting I'm working on a porting of the dspace-discovery idea to the JSPUI. As side effect I'm trying to better understand and (if possible) improve discovery self...
I have mainly completed the replacement of DSpace Lucene engine with the one provided by Sorl/Discovery, faceting on search results work well also for metadata with authority key. Now I'm thinking about the use of SOLR to replace the dspace browse system, I'm facing with several issues that I want try to summarize showing different strategies. Use SOLR for browsing has IMHO the following pros: - we will use an external well established library to manage our browse system - we will use an unified approach for search and browse - performance? probably but I have not real data comparison between indexing and query time in our current Browse system and SOLR 1) SOLR facets are not good for pagination: - As far as I know there is not out-of-box way to get answer to this question: "how many facet I have for this field in this query?" - you can navigate "facet result" using offset and limit (show facet from position X to position Y) but you can't ask to start with the facet "My Value" or from a facet that start with letter "X" This mean that we are not able with SOLR to reproduce the same features of our the current browse system, no total count of authors, keywords, etc. and not jump to a position in the index... So if we use this approach we should remove some existent functionalities or look to the SOLR facet component to see if we are able to improve it and contribute back to the SOLR community. 2) Using SOLR TermsComponent: during my exploration I found this new component in SOLR 1.4 http://wiki.apache.org/solr/TermsComponent It allows great pagination on field terms (total count, offset, limit, jump to are all supported)... but it doesn't work in a combined way with query. This mean that we are not able to use it to provide browse of metadata values within a community or collection. We could workaround this limit making several copies of the "browse metadata" in solr field specific of a community or collection, i.e we will have solr fields like author_m64 (author in community with id 64) and so on. I'm not sure if there are issues to put so much fields for document. For any metadata browse we will get one addition field for any community and collection, so with repository with a height number of communities/collections, for example 200 communities and 2k collections, we will get document with potential 2,2K fields for any browse. 3) the last option that I see is to add a new core to SOLR (i.e. browse), the SOLR "browse document" could have the following fields browse-type (author, keywords, publishers, etc.), browse-unique-value (the value to lookup), value (the value to display), authority_key (the authority key if any), sort (the sort value), item_id (repeatable, the id of all item that use this term) using a solr core "browse centric" instead of a core "item centric" will simplify and resolve all our pagination issues. Instead new issues arise related to filling and keep up-to-date this new index... after a first raw evaluation I think that we need how many "solr insert/update" as current db browse insert/update... pros of this strategy vs previous - integration of additional information, indexing of "authority source" could be easily integrated. If you have a directory of institutional author and you want put all the "institutional author" in the browse index you can easly accomplish this also if there is no item for an "institutional author". The same thing apply to subject classification, etc. cons: - there are not facet opportunities, we can't filter authors in the repository in a specific topic (based on item keywords) My preference is for the solution 2 but I will be happy to hear other idea. Andrea Dott. Andrea Bollini Project Manager, IT Architect& Systems Integrator Sezione Servizi per le Biblioteche e l'Editoria Elettronica CILEA,http://www.cilea.it tel. +39 06-59292853 cel. +39 348-8277525 --- Disclaimer: the content of this email is confidential and may be privileged, and it must not be disclosed or copied without the sender's consent. If you have received this message in error, please notify the sender and remove it from your system. The content of this email does not constitute legal advice, nor any responsibility is accepted for loss or damage incurred as a result of acting upon its contents or attachments. The statements and opinions expressed in this email are those of the author and do not necessarily reflect those of the employer. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Dspace-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-devel
