In our case (very similar to the "Netflix movie titles" use case) the AnalyzingSuggester's FST grows by a factor of ~5 when we generate the token graph.
Looking up and joining individual "postings lists" for the individual tokens would certainly work, but is certainly more work than injecting a token graph generator into the index analyzer's token filter chain :) (or modifying TokenStreamToAutomaton to generate the additional transitions, but that may be too low-level). Cheers, Oli -----Original Message----- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, January 16, 2013 5:38 PM To: java-user@lucene.apache.org Subject: Re: Suggesters: circumfix suggestions Netflix also does this, eg type transla (you need an account). I think it'd be good to somehow support this (Lucene's suggesters don't today). The first two approaches should conceptually work, but both will bloat the FST (I'd be curious to know how much!). Maybe another approach would be ... to index only single tokens into the suggester? And then, from the user's query, run the suggester on each token separately, and then do a second search (against a "normal" Lucene index) to find all documents containing those tokens? Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and then have a separate search index with "boston red sox rumor" indexed as a document. If the user types "red so", then you run suggest on "red" and on "so", and then run a hmm MultiPhraseQuery for (red|redmond|reddit) (so|sox|sophomore|...) against the index? How to score/sort the resulting hits will be interesting ... if you have strong priors / boost (e.g. you have a good source of "popularity" or something) then you could sort by that ... Mike McCandless http://blog.mikemccandless.com On Wed, Jan 16, 2013 at 4:27 PM, Oliver Christ <ochr...@ebscohost.com> wrote: > Hi, > > > > Has anyone tried to implement circumfix suggesters, where the > suggestion is a circumfix of the lookup string? > > > > E.g. "sox rumor" suggests "boston red sox rumors" (try it on > google.com). > > > > I think there are several of ways to implement this: > > > > * Given some multiword term, add all word subsequences to the > suggester individually ("boston red sox rumors" adds also "red sox > rumors", "sox rumors", "rumors") - that can be achieved using a > special TermFreqIterator. This turns the lookup problem into a > standard prefix search. While this works, it effectively modifies the > surface form, and the "full term" needs to be indexed and looked up elsewhere. > > * Constructing a token graph with appropriate substring arcs > from the (hopefully linear) token sequence, using a special TokenFilter. > The benefit is that the surface form is always the same, but the > automaton may become large (at least if you are using an > AnalyzingSuggester). > > * DIY, using suffix arrays or something similar. > > > > But I'm sure there are other ways and/or tradeoffs I haven't thought > about J I'd be interested in your feedback. > > > > Cheers, Oli > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org