Re: index phrases

Roxana Angheluta Wed, 22 Jun 2005 00:43:33 -0700

Dear all,
I am using Lucene for indexing documents.
I would like to include phrases (of a certain maximum length givenas a parameter) in the index. I know this is non-standard for e.g.searching, where a PhraseQuery can be built which makes use of theterms positions. However, I am not interested in searching, butrather in using the indexing terms for some statistics.
What would be an efficient way to do this? Is it possible to buildphrases in a filter after tokenization?
Roxana- could you give us a concrete example of what you're wantingto do?
A TokenFilter could certainly be used to aggregate multiple termsinto a single term that represents a phrase. This would happenduring the analysis process, which occurs along with tokenization.

Hi Erik, thanks for the answer.
I would like to index the following document:

This is a sample document.

something like:
"this"
"is"
"a"
"sample"
"document"
"this is"
"is a"
"a sample"
"this a"
"is sample"
"a document"
"sample document"
"this is a"
"is a sample"
"a sample document"

In this example the maximum length of an n-gram is 3 and the length ofthe moving window accross text is also 3.In fact I would like a full analyzer to do the job, i.e. define astrategy to filter out/clean spurious n-grams: e.g. remove n-grams madeout only/partially of stopwords, eliminate just stopwords from the n-gram.


Sebastian has kindly provided his code, which does the job.

roxana

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index phrases

Reply via email to