hi, although i've been trying to get my code into shape to upload to jira (holidays got in the way a bit), I guess i think there might be some issues making my implementation work for general use.
i based my design on certain assumptions, such as the fact I don't update indexes. once my index is built i enumerate all the terms and build the auxiliary fastssWc index, a lucene index with deletion neighborhoods. if you do index updates though, then you would have to have some hook to index any new terms that are added as a result of these documents. so maybe its a good idea if we can think of a better approach rather than having a secondary lucene index. there's also some practical stuff about my implementation i don't like, like for one i'm not able to reuse the existing fuzzy/wildcard expansion infrastructure (FilteredTermEnum) because, well, its entire design revolves around sequential enumeration. im still willing to put my stuff in jira though, although i'm still trying to think of some cleaner implementation... just would hate to spend a lot of time doing it if there are some better ideas... to answer your question the additional space requirements for k=1 are pretty small as they are based on number of unique tokens and length of unique tokens, and creating the additional data structures (auxiliary index in my case) is very fast. for example i have one index that is about 1GB and the additional space for FastSS is 50MB, but this will vary. its just the logistics of making a clean contrib module that is easy to use and maintain. On Mon, Jan 5, 2009 at 8:17 PM, Jason Rutherglen <jason.rutherg...@gmail.com > wrote: > Hello, > > I'm interested in getting FastSSFuzzy into Lucene, perhaps as a contrib > module. One question is how much would the index grow? We've got a list of > people's names we want to do spellchecking on for example. > > -J > > > -- Robert Muir rcm...@gmail.com