[
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925371#action_12925371
]
Tommaso Teofili commented on SOLR-2129:
---------------------------------------
bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.
ok, I added the <lib> tag and will commit a new patch when I'm finished with
these changes
bq. I've been struggling with these kinds of questions a lot lately. That is,
the marriage of two projects. Where should the code go? Setting up another ASF
project is a pain in the amount of hoops to jump through. Apache Labs doesn't
cut it for a number of reasons. Hosting on Github or Google Code is OK, but
loses the ASF community aspect. Sigh.
I agree with your point; I don't think it's easy to come with a final good and
general answer for such situations.
What comes to my mind to solve it generally is establishing a single
wide-purpose ASF project which contains integrations between many different ASF
projects, this could be good to prepare the base for two projects that want to
"marry" but it could be too much general and maybe not easy to maintain from a
community point of view (e.g.: should all the Lucene committers commit on
"integrations" project too only because someone integrated it with UIMA?);
another option could be to force two marrying projects to respect a standard
(e.g. CMIS) so that developing a specialized "connector" wouldn't be needed
anymore but I don't think it's always possible to do so since it could require
a huge effort.
In this particular case, in my opinion, the code should go into the proper
project depending on which "pipeline" is being changed/enhanced. Therefore
since in this Solr-UIMA integration we're adding a step to the Solr indexing
process via an UpdateRequestProcessor I think it should be part of Solr
codebase whereas since in the SolrCASConsumer we'd be adding a (final) Consumer
to the UIMA pipeline that should be part of UIMA codebase.
> Provide a Solr module for dynamic metadata extraction/indexing with Apache
> UIMA
> -------------------------------------------------------------------------------
>
> Key: SOLR-2129
> URL: https://issues.apache.org/jira/browse/SOLR-2129
> Project: Solr
> Issue Type: New Feature
> Reporter: Tommaso Teofili
> Assignee: Robert Muir
> Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch,
> SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be
> exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document
> and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which
> triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences
> (with a tokenizer and an hidden Markov model tagger), named entities,
> language, suggested category, keywords and concepts (exploiting external
> services from OpenCalais and AlchemyAPI). Such an implementation can be
> easily extended adding or selecting different UIMA analysis engines, both
> from UIMA repositories on the web or creating new ones from scratch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]