Wait, I didn't mean to pad the entire string. If the string is broken on _ already, then NGramFilter already receives the individual terms and you can put a Filter in front that will pass through a padded token?
Shai On Fri, Jul 19, 2013 at 3:45 PM, Becker, Thomas <thomas.bec...@netapp.com>wrote: > In general the data for this field is that simple, but additional > characters are allowed beyond [a-z_]. Do I need to tokenize on whitespace? > I really don't know. Essentially, the question is whether we expect > "quota tom" to match quota_tom or not. I spoke to some colleagues and they > thought it should since both "quota" and "tom" are partial matches that > would AND together. Tokenizing the entire input whitespace and all > precludes this match. I'd appreciate some input from anyone on what the > best user experience would be here; I'm trying to operate on principle of > least surprise ;) > > With regard to the padding suggestion, I'm still not sure this will work. > Because again at indexing time there is typically no whitespace. So > padding "quota_tommy_1234" to "## quota_tommy_1234##" before trigramming is > not going to produce a to# token that I would need in order for "quota to" > to match. > > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Friday, July 19, 2013 7:58 AM > To: java-user@lucene.apache.org > Subject: RE: Partial word match using n-grams > > Got it...almost. > > Y. You're right. FuzzyQuery is not at all what you want. > > Don't know if your data is actually as simple as this example. Do you > need to tokenize on whitespace? Would it make sense to replace spaces in > the query with underscores and then trigramify the whole query as if it > were a single term? > > ________________________________________ > From: Becker, Thomas [thomas.bec...@netapp.com] > Sent: Thursday, July 18, 2013 8:59 PM > To: java-user@lucene.apache.org > Subject: RE: Partial word match using n-grams > > Thanks for the reply Tim. I really should have been clearer. Let's say I > have an object named "quota_tommy_1234". I'd like to match that object > with any 3 character (or more) substring of that name. So for example: > > quo > tom > 234 > quota > etc. > > Further, at search time I'm splitting input on whitespace before > tokenizing into PhraseQueries and then ANDing them together. So using the > example above I also want the following queries to match: > > quo tom > quo 234 > quota to <- this is the problem because there are no trigrams of "to" > > That said, in response to your points: > > 1) Not sure FuzzyQuery is what I need; I'm not trying to match via > misspellings, which is the main function of FuzzyQuery is it not? > > 2) The original names are all going to be > 3 characters, so there are no > 1 or 2 letter terms at indexing time. So generating the bigram "to" at > search time will never match anything, unless I switch to bigrams at > indexing time also, which is what I'm asking about. > > 3) Again the names are all > 3 characters so I don't need to pad at > indexing time. > > 4) Hopefully my explanation above clarifies. > > I should point out that I'm a Lucene novice and am not at all sure that > what I'm doing is optimal. But I have been impressed with how easy it is > to get something working very quickly! > > ________________________________________ > From: Allison, Timothy B. [talli...@mitre.org] > Sent: Thursday, July 18, 2013 7:49 PM > To: java-user@lucene.apache.org > Subject: RE: Partial word match using n-grams > > Tommy, > I'm sure that I don't fully understand your use case and your data. > Some thoughts: > > 1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your > needs or else you wouldn't have gone the ngram route. If fuzzy term search > + phrase/proximity search would meet your needs, see if > ComplexPhraseQueryParser would work (although it looks like you're already > building your own queries). > > 2) Would it make sense to modify NGramFilter so that it outputs a bigram > for a two letter term and a unigram for a one letter term? Might be > messy...and "ab" in this scenario would never match "abc" > > 3) Would it make sense to pad your terms behind the scenes with > "##"...this would add bloat, but not nearly as much as variable gram sizes > with 1<= n <=3 > > ab -> ##ab## yields trigrams ##a, #ab, ab#, b## > > 4) How partial and what types of partial do you need? This is related to > 1). If minimum edit distance is sufficient; use it, especially with the > blazing fast automaton (thank you, Robert Muir). If you have a smallish > dataset you might consider allowing leading wildcards so that you could > easily find all words, for example, containing abc with *abc*. If your > dataset is larger, you might consider something like > ReversedWildcardFilterFactory (Solr) to speed this type of matching. > > I look forward to other opinions from the list. > > -----Original Message----- > From: Becker, Thomas [mailto:thomas.bec...@netapp.com] > Sent: Thursday, July 18, 2013 3:55 PM > To: java-user@lucene.apache.org > Subject: Partial word match using n-grams > > One of our main use-cases for search is to find objects based on partial > name matches. I've implemented this using n-grams and it works pretty > well. However we're currently using trigrams and that causes an > interesting problem when searching for things like "abc ab" since we first > split on whitespace and then construct PhraseQuerys containing each trigram > yielded by the "word". Obviously we cannot get a trigram out of "ab". So > our choices would seem to be either discard this part of the search term > which seems unwise, or to reduce the minimum n-gram size. But I'm slightly > concerned about the resulting bloat in both the of number of Terms stored > in the index as well as contained in queries. Is this something I should > be concerned about? It just "feels" like a query for the word "abcdef" > shouldn't require a PhraseQuery of 15 terms (assuming n-grams 1,3). Is > this the best way to do partial word matches? Thanks in advance. > > -Tommy > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >