We index time synonyms means you bloat the index with a lot of new postings, most of them are just duplicates of each other. And in my case, cause for every synonym there's a weight, I cannot even consider postings deduplication...
There's a tradeoff here (as usual). Both approaches have pros and cons. I think index time is better in the end because a larger index can be solved by throwing more hardware at it. But queries with thousands of terms are a real issue. One thing I can look at is if the synonym sets can be 'grouped' in a way that instead of all the terms I index a group ID or something and then during search i resolve a term to all the groups it may belong to... I'll need to think about it more. On Jul 18, 2013 7:49 PM, "Walter Underwood" <[email protected]> wrote: > There are two serious issues with query-time synonyms, speed and > correctness. > > 1. Expanding a term to 1000 synonyms at query time means 1000 term > lookups. This will not be fast. Expanding the term at index time means 1000 > posting list entries, but only one term lookup at query time. > > 2. Query time expansion will give higher scores to the more rare synonyms. > This is almost never what you want. If I make "TV" and "television" > synonyms, I want them both to score the same. But if TV is 10X more common > than television, then documents with the rare term (television) will score > better. > > wunder > > On Jul 18, 2013, at 5:54 AM, Shai Erera wrote: > > The examples I've seen so far are single words. But I learned today > something new .. the number of "synonyms" returned for a word may be in the > range of hundreds, sometimes even thousands. > So I'm not sure query-time synonyms may work at all .. what do you think? > > Shai > > > On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky > <[email protected]>wrote: > >> Your best bet is to preprocess queries and expand synonyms in your own >> application layer. The Lucene/Solr synonym implementation, design, and >> architecture is fairly lightweight (although FST is a big improvement) and >> not architected for large and dynamic synonym sets. >> >> Do you need multi-word phrase synonyms as well, or is this strictly >> single-word synonyms? >> >> -- Jack Krupansky >> >> *From:* Shai Erera <[email protected]> >> *Sent:* Thursday, July 18, 2013 1:36 AM >> *To:* [email protected] >> *Subject:* Programmatic Synonyms Filter (Lucene and/or Solr) >> >> Hi >> >> I was asked to integrate with a system which provides synonyms for words >> through API. I checked the existing synonym filters in Lucene and Solr and >> they all seem to take a synonyms map up front. >> >> E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so >> it's not really programmatic in the sense that I can provide an impl which >> will pull the synonyms through the other system's API. >> >> Solr SynonymFilterFactory just loads the synonyms from a file into a >> SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I >> can extend that one either. >> >> The problem is that the synonyms DB I should integrate with is HUGE and >> will probably not fit in RAM (SynonymMap). Nor is it currently possible to >> pull all available synonyms from it in one go. The API I have is something >> like String[] getSynonyms(String word). >> >> So I have few questions: >> >> 1) Did I miss a Filter which does take a programmatic syn-map which I can >> provide my own impl to? >> >> 2) If not, Would it make sense to modify SynonymMap to offer >> getSynonyms(word) API (using BytesRef / CharsRef of course), with an >> FSTSynonymMap default impl so that users can provide their own impl, e.g. >> not requiring everything to be in RAM? >> >> 2.1) Side-effect benefit, I think, is that we won't require everyone to >> deal with the FST API that way, though I'll admit I cannot think of may use >> cases for not using SynonymFilter as-is ... >> >> 3) If the answer to (1) and (2) is NO, I guess my only option is to >> implement my own SynonymFilter, copying most of the code from Lucene's ... >> right? >> >> Shai >> > > > -- > Walter Underwood > [email protected] > > > >
