Re: use Lucene to index sentences

Marc Hadfield Mon, 06 Feb 2006 13:39:50 -0800

Hi AJ -

Depending on your need, you could create a lucene document for eachsentence (in which case searching and returning sentences is trivial),or create a lucene document for each of your documents, with embeddedsentence start/stop markers (as a special symbol). or, instead of aspecial symbol, you can increase the token count after eachend-of-sentence so that there is a large gap inbetween sentences -- thiswill give higher scores to intra-sentence matches.

if you insert special sentence marker symbols, then you could use a spansearch to guarantee that a phrase happens inside a sentence. when amatch occurs, you can use the document's termpositionvector object tore-create the original sentence, or alternatively, use the embeddedsentence number in lucene (perhaps symbols like "__sentence_start" and"__sentence_num_20") to grab the original sentence from a filecontaining the full text with sentence markers (perhaps xml tags:"<sentence num=20>").

I use the techniques such as the above for a very large lucene index ofdocuments with embedded sentence markers. There are various trade-offsin terms of index size (how much info to keep in index), expected queryperformance, and so on.


---marc hadfield



AJ Chen wrote:

I'll appreciate any advice on whether Lucene is appropriate for index/search
sentences.  I have millions of documents broken down into millions of
sentences. Each sentence does not exist as a document.  All these sentences
are in a small number of big files. How can I use Lucene to index/search the
sentences? Search will return which sentence matches the query.  If Lucene
does not do it, any better approach besides using mysql database?

Thanks,
AJ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: use Lucene to index sentences

Reply via email to