We index time synonyms means you bloat the index with a lot of new
postings, most of them are just duplicates of each other. And in my case,
cause for every synonym there's a weight, I cannot even consider postings
deduplication...

There's a tradeoff here (as usual). Both approaches have pros and cons. I
think index time is better in the end because a larger index can be solved
by throwing more hardware at it. But queries with thousands of terms are a
real issue.

One thing I can look at is if the synonym sets can be 'grouped' in a way
that instead of all the terms I index a group ID or something and then
during search i resolve a term to all the groups it may belong to... I'll
need to think about it more.
On Jul 18, 2013 7:49 PM, "Walter Underwood" <[email protected]> wrote:

> There are two serious issues with query-time synonyms, speed and
> correctness.
>
> 1. Expanding a term to 1000 synonyms at query time means 1000 term
> lookups. This will not be fast. Expanding the term at index time means 1000
> posting list entries, but only one term lookup at query time.
>
> 2. Query time expansion will give higher scores to the more rare synonyms.
> This is almost never what you want. If I make "TV" and "television"
> synonyms, I want them both to score the same. But if TV is 10X more common
> than television, then documents with the rare term (television) will score
> better.
>
> wunder
>
> On Jul 18, 2013, at 5:54 AM, Shai Erera wrote:
>
> The examples I've seen so far are single words. But I learned today
> something new .. the number of "synonyms" returned for a word may be in the
> range of hundreds, sometimes even thousands.
> So I'm not sure query-time synonyms may work at all .. what do you think?
>
> Shai
>
>
> On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky 
> <[email protected]>wrote:
>
>>   Your best bet is to preprocess queries and expand synonyms in your own
>> application layer. The Lucene/Solr synonym implementation, design, and
>> architecture is fairly lightweight (although FST is a big improvement) and
>> not architected for large and dynamic synonym sets.
>>
>> Do you need multi-word phrase synonyms as well, or is this strictly
>> single-word synonyms?
>>
>> -- Jack Krupansky
>>
>>  *From:* Shai Erera <[email protected]>
>> *Sent:* Thursday, July 18, 2013 1:36 AM
>> *To:* [email protected]
>> *Subject:* Programmatic Synonyms Filter (Lucene and/or Solr)
>>
>>     Hi
>>
>> I was asked to integrate with a system which provides synonyms for words
>> through API. I checked the existing synonym filters in Lucene and Solr and
>> they all seem to take a synonyms map up front.
>>
>> E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so
>> it's not really programmatic in the sense that I can provide an impl which
>> will pull the synonyms through the other system's API.
>>
>> Solr SynonymFilterFactory just loads the synonyms from a file into a
>> SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I
>> can extend that one either.
>>
>> The problem is that the synonyms DB I should integrate with is HUGE and
>> will probably not fit in RAM (SynonymMap). Nor is it currently possible to
>> pull all available synonyms from it in one go. The API I have is something
>> like String[] getSynonyms(String word).
>>
>> So I have few questions:
>>
>> 1) Did I miss a Filter which does take a programmatic syn-map which I can
>> provide my own impl to?
>>
>> 2) If not, Would it make sense to modify SynonymMap to offer
>> getSynonyms(word) API (using BytesRef / CharsRef of course), with an
>> FSTSynonymMap default impl so that users can provide their own impl, e.g.
>> not requiring everything to be in RAM?
>>
>> 2.1) Side-effect benefit, I think, is that we won't require everyone to
>> deal with the FST API that way, though I'll admit I cannot think of may use
>> cases for not using SynonymFilter as-is ...
>>
>> 3) If the answer to (1) and (2) is NO, I guess my only option is to
>> implement my own SynonymFilter, copying most of the code from Lucene's ...
>> right?
>>
>> Shai
>>
>
>
> --
> Walter Underwood
> [email protected]
>
>
>
>

Reply via email to