[jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

Tommaso Teofili (JIRA) Wed, 27 Oct 2010 06:52:48 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925371#action_12925371
 ]


Tommaso Teofili commented on SOLR-2129:
---------------------------------------

bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

ok, I added the <lib> tag and will commit a new patch when I'm finished with 
these changes

bq. I've been struggling with these kinds of questions a lot lately. That is, 
the marriage of two projects. Where should the code go? Setting up another ASF 
project is a pain in the amount of hoops to jump through. Apache Labs doesn't 
cut it for a number of reasons. Hosting on Github or Google Code is OK, but 
loses the ASF community aspect. Sigh.

I agree with your point; I don't think it's easy to come with a final good and 
general answer for such situations.

What comes to my mind to solve it generally is establishing a single 
wide-purpose ASF project which contains integrations between many different ASF 
projects, this could be good to prepare the base for two projects that want to 
"marry" but it could be too much general and maybe not easy to maintain from a 
community point of view (e.g.: should all the Lucene committers commit on 
"integrations" project too only because someone integrated it with UIMA?); 
another option could be to force two marrying projects to respect a standard 
(e.g. CMIS) so that developing a specialized "connector" wouldn't be needed 
anymore but I don't think it's always possible to do so since it could require 
a huge effort.

In this particular case, in my opinion, the code should go into the proper 
project depending on which "pipeline" is being changed/enhanced. Therefore 
since in this Solr-UIMA integration we're adding a step to the Solr indexing 
process via an UpdateRequestProcessor I think it should be part of Solr 
codebase whereas since in the SolrCASConsumer we'd be adding a (final) Consumer 
to the UIMA pipeline that should be part of UIMA codebase.


> Provide a Solr module for dynamic metadata extraction/indexing with Apache 
> UIMA
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-2129
>                 URL: https://issues.apache.org/jira/browse/SOLR-2129
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Robert Muir
>         Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
> SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be 
> exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document 
> and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which 
> triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
> (with a tokenizer and an hidden Markov model tagger), named entities, 
> language, suggested category, keywords and concepts (exploiting external 
> services from OpenCalais and AlchemyAPI). Such an implementation can be 
> easily extended adding or selecting different UIMA analysis engines, both 
> from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

Reply via email to