Have a look at the TeeTokenFilter and the SinkTokenizer. You could extend/implement those to have a lookup in your list, and then when you have a match, add the token to the Sink, which then allows you to index a separate field containing your named entities. The TeeTF and SinkTok are located in the contrib/analysis package of the latest Lucene release. Alternatively, you could implement a TokenFilter that adds a payload onto a term whenever it comes across a Named Entity.

Alternatively, you might look into preprocessing with OpenNLP or LingPipe or some tool like that which can go beyond just list lookup for Named Entities. List based approaches are useful, but they also tend to be brittle.

<shameless_somewhat_self_serving_but_hopefully_useful_plug>
Using OpenNLP is described in my book: http://manning.com/ingersoll/ in chapter 5 and I believe Tom (my coauthor) even has code in there for plugging OpenNLP into the Lucene analysis process)
</shameless_somewhat_self_serving_but_hopefully_useful_plug>

On Mar 3, 2009, at 1:13 AM, Seid Mohammed wrote:

I want to index document conents in two ways, one just a simple
content, and the other as named entity.
the senario is like this.
if i have this document "the source of Nile is Ethiopia"
then I want to index "source" as a normal content, "Nile" as river
name, and "Ethiopia" as Country name. so that later if ask a question
"where is the source of Nile", it should retrieve Ethiopia as an
Answer.

Note: I will have List of River names, Country names,... so that
during indexing I will compare every word of a document with my lists.

thanks a lot

Seid M
--
"RABI ZIDNI ILMA"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to