> > How much do you know about the frequency of synonym updates for this > synonym source API? >
The "synonyms", or phrases alternatives, are built from a dynamic DB that is updated along with content updates. I suspect that after e.g. the initial crawl it will already contain most word alternatives, and then most likely to update at a very low rate. I think I'll go with index-time synonyms because I'm not sure running queries with even few hundred terms is going to perform well. And if the synonyms could be divided to Equivalence Classes, I'll index a class ID instead of all the words. Thanks guys for the comments! Shai On Thu, Jul 18, 2013 at 9:18 PM, SUJIT PAL <[email protected]> wrote: > Hi Shai, > > We have a setup where we annotate terms in documents with concepts. Each > concept has a number of synonyms. During indexing we annotate terms in the > document (can be multi-word) with the concept ID at the same offset as the > matched term. For us, we just need to place a single additional term at the > same offset, but you could extend this to placing all the synonyms your > service returns instead. I got the idea for this from the Synonym filter > described in the LIA2 book. > > Advantage of this is that you dont run into either the URL size limit or > the boolean clauses limit, and more importantly, you factor in the value of > context since you are dealing with full sentences rather than terms when > deciding synonyms - for example, the word "lark" could be a bird or a prank > depending on the context. > > -sujit > > On Jul 18, 2013, at 6:12 AM, Jack Krupansky wrote: > > > Maybe a custom search component would be in order, to “enrich” the > incoming query. Again, preprocessing the query for synonym expansion before > Solr parses it. It could call the external synonym API and cache synonyms > as well. > > > > But, I’d still lean towards preprocessing in an application layer. > Although, for hundreds or thousands of synonyms it would probably hit the > 2048 common limit for URLs in some containers, which would need to be > raised. > > > > -- Jack Krupansky > > > > From: Shai Erera > > Sent: Thursday, July 18, 2013 8:54 AM > > To: [email protected] > > Subject: Re: Programmatic Synonyms Filter (Lucene and/or Solr) > > > > The examples I've seen so far are single words. But I learned today > something new .. the number of "synonyms" returned for a word may be in the > range of hundreds, sometimes even thousands. > > So I'm not sure query-time synonyms may work at all .. what do you think? > > > > Shai > > > > > > On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky <[email protected]> > wrote: > > Your best bet is to preprocess queries and expand synonyms in your own > application layer. The Lucene/Solr synonym implementation, design, and > architecture is fairly lightweight (although FST is a big improvement) and > not architected for large and dynamic synonym sets. > > > > Do you need multi-word phrase synonyms as well, or is this strictly > single-word synonyms? > > > > -- Jack Krupansky > > > > From: Shai Erera > > Sent: Thursday, July 18, 2013 1:36 AM > > To: [email protected] > > Subject: Programmatic Synonyms Filter (Lucene and/or Solr) > > > > Hi > > > > I was asked to integrate with a system which provides synonyms for words > through API. I checked the existing synonym filters in Lucene and Solr > and they all seem to take a synonyms map up front. > > > > E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so > it's not really programmatic in the sense that I can provide an impl which > will pull the synonyms through the other system's API. > > > > Solr SynonymFilterFactory just loads the synonyms from a file into a > SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I > can extend that one either. > > > > The problem is that the synonyms DB I should integrate with is HUGE and > will probably not fit in RAM (SynonymMap). Nor is it currently possible to > pull all available synonyms from it in one go. The API I have is something > like String[] getSynonyms(String word). > > > > So I have few questions: > > > > 1) Did I miss a Filter which does take a programmatic syn-map which I > can provide my own impl to? > > > > 2) If not, Would it make sense to modify SynonymMap to offer > getSynonyms(word) API (using BytesRef / CharsRef of course), with an > FSTSynonymMap default impl so that users can provide their own impl, e.g. > not requiring everything to be in RAM? > > > > 2.1) Side-effect benefit, I think, is that we won't require everyone to > deal with the FST API that way, though I'll admit I cannot think of may use > cases for not using SynonymFilter as-is ... > > > > 3) If the answer to (1) and (2) is NO, I guess my only option is to > implement my own SynonymFilter, copying most of the code from Lucene's ... > right? > > > > Shai > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
