[Dspace-devel] Different strategies to implement metadata browse with Discovery

Andrea Bollini Sun, 22 Aug 2010 05:51:05 -0700

  Hi all,
as discussed in the last IRC Dev meeting I'm working on a porting of the 
dspace-discovery idea to the JSPUI.
As side effect I'm trying to better understand and (if possible) improve 
discovery self...


I have mainly completed the replacement of DSpace Lucene engine with the 
one provided by Sorl/Discovery, faceting on search results work well 
also for metadata with authority key.
Now I'm thinking about the use of SOLR to replace the dspace browse 
system, I'm facing with several issues that I want try to summarize 
showing different strategies.
Use SOLR for browsing has IMHO the following pros:
- we will use an external well established library to manage our browse 
system
- we will use an unified approach for search and browse
- performance? probably but I have not real data comparison between 
indexing and query time in our current Browse system and SOLR


1) SOLR facets are not good for pagination:
  - As far as I know there is not out-of-box  way to get answer to this 
question: "how many facet I have for this field in this query?"
  - you can navigate "facet result" using offset and limit (show facet 
from position X to position Y) but you can't ask to start with the facet 
"My Value" or from a facet that start with letter "X"

This mean that we are not able with SOLR to reproduce the same features 
of our the current browse system, no total count of authors, keywords, 
etc. and not jump to a position in the index...
So if we use this approach we should remove some existent 
functionalities or look to the SOLR facet component to see if we are 
able to improve it and contribute back to the SOLR community.


2) Using SOLR TermsComponent: during my exploration I found this new 
component in SOLR 1.4
http://wiki.apache.org/solr/TermsComponent
It allows great pagination on field terms (total count, offset, limit, 
jump to are all supported)... but it doesn't work in a combined way with 
query.

This mean that we are not able to use it to provide browse of metadata 
values within a community or collection.
We could workaround this limit making several copies of the "browse 
metadata" in solr field specific of a community or collection, i.e we 
will have solr fields like author_m64 (author in community with id 64) 
and so on.
I'm not sure if there are issues to put so much fields for document. For 
any metadata browse we will get one addition field for any community and 
collection, so with repository with a height number of 
communities/collections, for example 200 communities and 2k 
collections,  we will get document with potential 2,2K fields for any 
browse.

3) the last option that I see is to add a new core to SOLR (i.e. 
browse), the SOLR "browse document" could have the following fields
browse-type (author, keywords, publishers, etc.),    browse-unique-value 
(the value to lookup),    value (the value to display),    authority_key 
(the authority key if any),    sort (the sort  value),  item_id 
(repeatable, the id of all item that use this term)
using a solr core "browse centric" instead of a core "item centric" will 
simplify and resolve all our pagination issues. Instead new issues arise 
related to filling and keep up-to-date this new index...
after a first raw evaluation I think that we need how many "solr 
insert/update" as current db browse insert/update...

pros of this strategy vs previous
- integration of additional information, indexing of "authority source" 
could be easily integrated. If you have a directory of institutional 
author and you want put all the "institutional author" in the browse 
index you can easly accomplish this also if there is no item for an 
"institutional author". The same thing apply to subject classification, etc.
cons:
- there are not facet opportunities, we can't filter authors in the 
repository in a specific topic (based on item keywords)

My preference is for the solution 2 but I will be happy to hear other idea.

Andrea

Dott. Andrea Bollini
Project Manager, IT Architect&  Systems Integrator
Sezione Servizi per le Biblioteche e l'Editoria Elettronica
CILEA,http://www.cilea.it
tel. +39 06-59292853
cel. +39 348-8277525

---

Disclaimer: the content of this email is confidential and may be privileged, 
and it must not be disclosed or copied without the sender's consent. If you 
have received this message in error, please notify the sender and remove it 
from your system. The content of this email does not constitute legal advice, 
nor any responsibility is accepted for loss or damage incurred as a result of 
acting upon its contents or attachments.
The statements and opinions expressed in this email are those of the author and 
do not necessarily reflect those of the employer.


------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] Different strategies to implement metadata browse with Discovery

Reply via email to