Dear all,
I am using Lucene for indexing documents.
I would like to include phrases (of a certain maximum length given
as a parameter) in the index. I know this is non-standard for e.g.
searching, where a PhraseQuery can be built which makes use of the
terms positions. However, I am not interested in searching, but
rather in using the indexing terms for some statistics.
What would be an efficient way to do this? Is it possible to build
phrases in a filter after tokenization?
Roxana- could you give us a concrete example of what you're wanting
to do?
A TokenFilter could certainly be used to aggregate multiple terms
into a single term that represents a phrase. This would happen
during the analysis process, which occurs along with tokenization.
Hi Erik, thanks for the answer.
I would like to index the following document:
This is a sample document.
something like:
"this"
"is"
"a"
"sample"
"document"
"this is"
"is a"
"a sample"
"this a"
"is sample"
"a document"
"sample document"
"this is a"
"is a sample"
"a sample document"
In this example the maximum length of an n-gram is 3 and the length of
the moving window accross text is also 3.
In fact I would like a full analyzer to do the job, i.e. define a
strategy to filter out/clean spurious n-grams: e.g. remove n-grams made
out only/partially of stopwords, eliminate just stopwords from the n-gram.
Sebastian has kindly provided his code, which does the job.
roxana
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]