Dear all,

I am using Lucene for indexing documents.

I would like to include phrases (of a certain maximum length given as a parameter) in the index. I know this is non-standard for e.g. searching, where a PhraseQuery can be built which makes use of the terms positions. However, I am not interested in searching, but rather in using the indexing terms for some statistics.

What would be an efficient way to do this? Is it possible to build phrases in a filter after tokenization?

Roxana- could you give us a concrete example of what you're wanting to do?

A TokenFilter could certainly be used to aggregate multiple terms into a single term that represents a phrase. This would happen during the analysis process, which occurs along with tokenization.
Hi Erik, thanks for the answer.
I would like to index the following document:

This is a sample document.

something like:
"this"
"is"
"a"
"sample"
"document"
"this is"
"is a"
"a sample"
"this a"
"is sample"
"a document"
"sample document"
"this is a"
"is a sample"
"a sample document"

In this example the maximum length of an n-gram is 3 and the length of the moving window accross text is also 3. In fact I would like a full analyzer to do the job, i.e. define a strategy to filter out/clean spurious n-grams: e.g. remove n-grams made out only/partially of stopwords, eliminate just stopwords from the n-gram.

Sebastian has kindly provided his code, which does the job.

roxana

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to