Re: FastSSFuzzy for faster fuzzy queries in Lucene

Robert Muir Mon, 05 Jan 2009 20:53:08 -0800

hi,

although i've been trying to get my code into shape to upload to jira
(holidays got in the way a bit), I guess i think there might be some issues
making my implementation work for general use.

i based my design on certain assumptions, such as the fact I don't update
indexes. once my index is built i enumerate all the terms and build the
auxiliary fastssWc index, a lucene index with deletion neighborhoods.

if you do index updates though, then you would have to have some hook to
index any new terms that are added as a result of these documents. so maybe
its a good idea if we can think of a better approach rather than having a
secondary lucene index.

there's also some practical stuff about my implementation i don't like, like
for one i'm not able to reuse the existing fuzzy/wildcard expansion
infrastructure (FilteredTermEnum) because, well, its entire design revolves
around sequential enumeration.

im still willing to put my stuff in jira though, although i'm still trying
to think of some cleaner implementation... just would hate to spend a lot of
time doing it if there are some better ideas...

to answer your question the additional space requirements for k=1 are pretty
small as they are based on number of unique tokens and length of unique
tokens, and creating the additional data structures (auxiliary index in my
case) is very fast. for example i have one index that is about 1GB and the
additional space for FastSS is 50MB, but this will vary. its just the
logistics of making a clean contrib module that is easy to use and maintain.

On Mon, Jan 5, 2009 at 8:17 PM, Jason Rutherglen <jason.rutherg...@gmail.com
> wrote:

> Hello,
>
> I'm interested in getting FastSSFuzzy into Lucene, perhaps as a contrib
> module.  One question is how much would the index grow?  We've got a list of
> people's names we want to do spellchecking on for example.
>
> -J
>
>
>

-- 
Robert Muir
rcm...@gmail.com

Re: FastSSFuzzy for faster fuzzy queries in Lucene

Reply via email to