Hi AJ -
Depending on your need, you could create a lucene document for each
sentence (in which case searching and returning sentences is trivial),
or create a lucene document for each of your documents, with embedded
sentence start/stop markers (as a special symbol). or, instead of a
special symbol, you can increase the token count after each
end-of-sentence so that there is a large gap inbetween sentences -- this
will give higher scores to intra-sentence matches.
if you insert special sentence marker symbols, then you could use a span
search to guarantee that a phrase happens inside a sentence. when a
match occurs, you can use the document's termpositionvector object to
re-create the original sentence, or alternatively, use the embedded
sentence number in lucene (perhaps symbols like "__sentence_start" and
"__sentence_num_20") to grab the original sentence from a file
containing the full text with sentence markers (perhaps xml tags:
"<sentence num=20>").
I use the techniques such as the above for a very large lucene index of
documents with embedded sentence markers. There are various trade-offs
in terms of index size (how much info to keep in index), expected query
performance, and so on.
---marc hadfield
AJ Chen wrote:
I'll appreciate any advice on whether Lucene is appropriate for index/search
sentences. I have millions of documents broken down into millions of
sentences. Each sentence does not exist as a document. All these sentences
are in a small number of big files. How can I use Lucene to index/search the
sentences? Search will return which sentence matches the query. If Lucene
does not do it, any better approach besides using mysql database?
Thanks,
AJ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]