Re: NGraming document for similar documents matching

Trejkaz Thu, 26 Jan 2012 15:54:49 -0800

On Fri, Jan 27, 2012 at 10:41 AM, Saurabh Gokhale
<saurabhgokh...@gmail.com> wrote:
> I wanted to check if Ngraming the document contents (space is not the
> issue) would make any good for better matching? Currently I see Ngram is
> mostly use for auto complete or spell checker but is this useful for
> similarity search?


I think this might actually be worse.

We find that normal tokenisation works well enough (we even stem -
people say that you shouldn't, for similarity, but it doesn't seem to
be too bad.)

The "ideal" way is probably to just do straight tokenisation and then
shingle the tokens which come out so that you get 3(?) words to a
token.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: NGraming document for similar documents matching

Reply via email to