Hi Shai,

We have a setup where we annotate terms in documents with concepts. Each 
concept has a number of synonyms. During indexing we annotate terms in the 
document (can be multi-word) with the concept ID at the same offset as the 
matched term. For us, we just need to place a single additional term at the 
same offset, but you could extend this to placing all the synonyms your service 
returns instead. I got the idea for this from the Synonym filter described in 
the LIA2 book.

Advantage of this is that you dont run into either the URL size limit or the 
boolean clauses limit, and more importantly, you factor in the value of context 
since you are dealing with full sentences rather than terms when deciding 
synonyms - for example, the word "lark" could be a bird or a prank depending on 
the context.

-sujit

On Jul 18, 2013, at 6:12 AM, Jack Krupansky wrote:

> Maybe a custom search component would be in order, to “enrich” the incoming 
> query. Again, preprocessing the query for synonym expansion before Solr 
> parses it. It could call the external synonym API and cache synonyms as well.
>  
> But, I’d still lean towards preprocessing in an application layer. Although, 
> for hundreds or thousands of synonyms it would probably hit the 2048 common 
> limit for URLs in some containers, which would need to be raised.
> 
> -- Jack Krupansky
>  
> From: Shai Erera
> Sent: Thursday, July 18, 2013 8:54 AM
> To: [email protected]
> Subject: Re: Programmatic Synonyms Filter (Lucene and/or Solr)
>  
> The examples I've seen so far are single words. But I learned today something 
> new .. the number of "synonyms" returned for a word may be in the range of 
> hundreds, sometimes even thousands.
> So I'm not sure query-time synonyms may work at all .. what do you think?
> 
> Shai
> 
> 
> On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky <[email protected]> 
> wrote:
> Your best bet is to preprocess queries and expand synonyms in your own 
> application layer. The Lucene/Solr synonym implementation, design, and 
> architecture is fairly lightweight (although FST is a big improvement) and 
> not architected for large and dynamic synonym sets.
>  
> Do you need multi-word phrase synonyms as well, or is this strictly 
> single-word synonyms?
> 
> -- Jack Krupansky
>  
> From: Shai Erera
> Sent: Thursday, July 18, 2013 1:36 AM
> To: [email protected]
> Subject: Programmatic Synonyms Filter (Lucene and/or Solr)
>  
> Hi
> 
> I was asked to integrate with a system which provides synonyms for words 
> through API. I checked the existing synonym filters in    Lucene and Solr and 
> they all seem to take a synonyms map up front. 
> 
> E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so it's 
> not really programmatic in the sense that I can provide an impl which will 
> pull the synonyms through the other system's API.
> 
> Solr SynonymFilterFactory just loads the synonyms from a file into a 
> SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I 
> can extend that one either.
> 
> The problem is that the synonyms DB I should integrate with is HUGE and will 
> probably not fit in RAM (SynonymMap). Nor is it currently possible to pull 
> all available synonyms from it in one go. The API I have is something like 
> String[] getSynonyms(String word).
> 
> So I have few questions:
> 
> 1) Did I miss a Filter which does take a programmatic syn-map which I can 
> provide my own impl to?
> 
> 2) If not, Would it make sense to modify SynonymMap to offer 
> getSynonyms(word) API (using BytesRef / CharsRef of course), with an 
> FSTSynonymMap default impl so that users can provide their own impl, e.g. not 
> requiring everything to be in RAM?
> 
> 2.1) Side-effect benefit, I think, is that we won't require everyone to deal 
> with the FST API that way, though I'll admit I cannot think of may use cases 
> for not using SynonymFilter as-is ...
>  
> 3) If the answer to (1) and (2) is NO, I guess my only option is to implement 
> my own SynonymFilter, copying most of the code from Lucene's ... right?
> 
> Shai
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to