Great discussion here; although I do believe it belongs on the Solr user list because we're not talking about development on Solr. I'm very tempted to cross-post but I believe that's discouraged so I won't.
> Still wondering where the Solr Community will bring this in the future? I strongly believe that Solr should focus on what it does best (being a search engine) and not on pipelines / data acquisition which is really a separate concern that is useful without Solr -- other apps could use such pipelines. This is a chief concern I have with the DIH. By the way I've used Endeca (a commercial long-time faceted search vendor) which has its own pipeline called "Forge". I used it on a project in which the pipelines were extremely extensive getting data from a dozen plus sources of varying flavors and manipulating the data in various ways. It addresses a key need, but the implementation is poor IMO. The interesting parts of it pertained to how it supports joins from sub-pipelines (i.e. chain of steps). I've not yet been in the same situation with Solr. I've gotten by with some basic stuff thrown together (shell scripts w/ XSLT) or simple DIH uses. I've been maintaining a list of software that could be used for a data pipeline for getting data into Solr. Here it is: * Calabache (XProc) * OpenPipe * ManifoldCF * ESBs (various options; includes Spring-Integration Framework) I don't have UIMA on this list since I think it's too focused on extracting data from unstructured text than on being a solid pipeline first & foremost. Roland, if your assessment on OpenPipeline going nowhere is true, then that's disappointing news. It's not clear to me that a data pipeline needs to be different than what ESBs do. Some pieces are missing but 80% of what's needed is there. When I next have a project getting data from many places I'll be able to think through this more. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/