Great discussion here; although I do believe it belongs on the Solr user list 
because we're not talking about development on Solr.  I'm very tempted to 
cross-post but I believe that's discouraged so I won't.

> Still wondering where the Solr Community will bring this in the future?

I strongly believe that Solr should focus on what it does best (being a search 
engine) and not on pipelines / data acquisition which is really a separate 
concern that is useful without Solr -- other apps could use such pipelines.  
This is a chief concern I have with the DIH.

By the way I've used Endeca (a commercial long-time faceted search vendor) 
which has its own pipeline called "Forge".  I used it on a project in which the 
pipelines were extremely extensive getting data from a dozen plus sources of 
varying flavors and manipulating the data in various ways.   It addresses a key 
need, but the implementation is poor IMO. The interesting parts of it pertained 
to how it supports joins from sub-pipelines (i.e. chain of steps). I've not yet 
been in the same situation with Solr. I've gotten by with some basic stuff 
thrown together (shell scripts w/ XSLT) or simple DIH uses.

I've been maintaining a list of software that could be used for a data pipeline 
for getting data into Solr.  Here it is:
* Calabache (XProc)
* OpenPipe
* ManifoldCF
* ESBs (various options; includes Spring-Integration Framework)

I don't have UIMA on this list since I think it's too focused on extracting 
data from unstructured text than on being a solid pipeline first & foremost.

Roland, if your assessment on OpenPipeline going nowhere is true, then that's 
disappointing news.

It's not clear to me that a data pipeline needs to be different than what ESBs 
do.  Some pieces are missing but 80% of what's needed is there.  When I next 
have a project getting data from many places I'll be able to think through this 
more.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/




Reply via email to