On Tue, Jun 21, 2005 at 04:01:41PM +0200, Roxana Angheluta wrote: > I would like to include phrases (of a certain maximum length given as a > parameter) in the index. I know this is non-standard for e.g. searching, > where a PhraseQuery can be built which makes use of the terms positions. > However, I am not interested in searching, but rather in using the > indexing terms for some statistics.
I built a filter for exactly this purpose (statistics over word combinations); I will send it to you directly. It uses some containers from org.apache.commons.collections, so I guess it's not suitable yet for inclusion in the lucene distribution. The filter handles position increments > 1 by inserting fillers, so a phrase "look at my car" will become "look _ _ car" when the words "at" and "my" are filtered out from the stream, and "car" is assigned a position increment of 3. But since StopFilter does not currently produce position increments, you will hardly ever see them. The filter does not handle position increments == 0. (But will produce them for ngrams that start at the same position.) I think there is also some stuff for ngram indexing in nutch. Regards, Sebastian -- Sebastian Kirsch <[EMAIL PROTECTED]> [http://www.sebastian-kirsch.org/] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]