The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines. After that, you can wildcards. This will use very little space. I believe leading&trailing wildcards are supported now, right?
On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin <izavo...@caci.com> wrote: > The user uploads a set of text files, either all of them at once or one at a > time, and then they will be searched locally on the phone against a set of > "hotlist" words. This assumes no connection to any sort of server so > everything must be done locally. > > I already have Lucene integrated so I might want to try the n-gram approach. > But I just want to double-check first that it will work with any Unicode > string, be it an English word, a foreign word, a sequence of digits or any > random sequence of Unicode characters. In other words, this is not in any way > language-dependent/-specific. > > Thanks, > > Ilya > > -----Original Message----- > From: Dawid Weiss [mailto:dawid.we...@gmail.com] > Sent: Sunday, August 26, 2012 3:55 AM > To: java-user@lucene.apache.org > Subject: Re: Efficient string lookup using Lucene > >> Does Lucene support this type of structure, or do I need to somehow >> implement it outside Lucene? > > You'd have to implement it separately but it'd be much, much smaller than > Lucene itself (even obfuscated). > >> By the way, I need this to run on an Android phone so size of memory might >> be an issue... > > How large is your input? Do you need to index on the android or just read the > index on it? These are all factors to take into account. I mentioned suffix > trees and suffix arrays because these two are "canonical" data structures to > perform any substring lookups in constant time (in fact, the lookup takes the > number of elements of the matched input string, building the suffix tree/ > array is O(n), at least in theory). > > If you already have Lucene integrated in your pipeline then that n-gram > approach will also work. If you know your minimum match substring length to > be p then index p-sized shingles. For strings longer than p you can create a > query which will search for all n-gram occurrences and take into account > positional information to remove false matches. > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- Lance Norskog goks...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org