On Fri, Jan 27, 2012 at 10:41 AM, Saurabh Gokhale <saurabhgokh...@gmail.com> wrote: > I wanted to check if Ngraming the document contents (space is not the > issue) would make any good for better matching? Currently I see Ngram is > mostly use for auto complete or spell checker but is this useful for > similarity search?
I think this might actually be worse. We find that normal tokenisation works well enough (we even stem - people say that you shouldn't, for similarity, but it doesn't seem to be too bad.) The "ideal" way is probably to just do straight tokenisation and then shingle the tokens which come out so that you get 3(?) words to a token. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org