Re: Programmatic Synonyms Filter (Lucene and/or Solr)

Shai Erera Thu, 18 Jul 2013 21:43:36 -0700

>
> How much do you know about the frequency of synonym updates for this
> synonym source API?
>


The "synonyms", or phrases alternatives, are built from a dynamic DB that
is updated along with content updates. I suspect that after e.g. the
initial crawl it will already contain most word alternatives, and then most
likely to update at a very low rate.

I think I'll go with index-time synonyms because I'm not sure running
queries with even few hundred terms is going to perform well. And if the
synonyms could be divided to Equivalence Classes, I'll index a class ID
instead of all the words.

Thanks guys for the comments!

Shai


On Thu, Jul 18, 2013 at 9:18 PM, SUJIT PAL <[email protected]> wrote:

> Hi Shai,
>
> We have a setup where we annotate terms in documents with concepts. Each
> concept has a number of synonyms. During indexing we annotate terms in the
> document (can be multi-word) with the concept ID at the same offset as the
> matched term. For us, we just need to place a single additional term at the
> same offset, but you could extend this to placing all the synonyms your
> service returns instead. I got the idea for this from the Synonym filter
> described in the LIA2 book.
>
> Advantage of this is that you dont run into either the URL size limit or
> the boolean clauses limit, and more importantly, you factor in the value of
> context since you are dealing with full sentences rather than terms when
> deciding synonyms - for example, the word "lark" could be a bird or a prank
> depending on the context.
>
> -sujit
>
> On Jul 18, 2013, at 6:12 AM, Jack Krupansky wrote:
>
> > Maybe a custom search component would be in order, to “enrich” the
> incoming query. Again, preprocessing the query for synonym expansion before
> Solr parses it. It could call the external synonym API and cache synonyms
> as well.
> >
> > But, I’d still lean towards preprocessing in an application layer.
> Although, for hundreds or thousands of synonyms it would probably hit the
> 2048 common limit for URLs in some containers, which would need to be
> raised.
> >
> > -- Jack Krupansky
> >
> > From: Shai Erera
> > Sent: Thursday, July 18, 2013 8:54 AM
> > To: [email protected]
> > Subject: Re: Programmatic Synonyms Filter (Lucene and/or Solr)
> >
> > The examples I've seen so far are single words. But I learned today
> something new .. the number of "synonyms" returned for a word may be in the
> range of hundreds, sometimes even thousands.
> > So I'm not sure query-time synonyms may work at all .. what do you think?
> >
> > Shai
> >
> >
> > On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky <[email protected]>
> wrote:
> > Your best bet is to preprocess queries and expand synonyms in your own
> application layer. The Lucene/Solr synonym implementation, design, and
> architecture is fairly lightweight (although FST is a big improvement) and
> not architected for large and dynamic synonym sets.
> >
> > Do you need multi-word phrase synonyms as well, or is this strictly
> single-word synonyms?
> >
> > -- Jack Krupansky
> >
> > From: Shai Erera
> > Sent: Thursday, July 18, 2013 1:36 AM
> > To: [email protected]
> > Subject: Programmatic Synonyms Filter (Lucene and/or Solr)
> >
> > Hi
> >
> > I was asked to integrate with a system which provides synonyms for words
> through API. I checked the existing synonym filters in    Lucene and Solr
> and they all seem to take a synonyms map up front.
> >
> > E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so
> it's not really programmatic in the sense that I can provide an impl which
> will pull the synonyms through the other system's API.
> >
> > Solr SynonymFilterFactory just loads the synonyms from a file into a
> SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I
> can extend that one either.
> >
> > The problem is that the synonyms DB I should integrate with is HUGE and
> will probably not fit in RAM (SynonymMap). Nor is it currently possible to
> pull all available synonyms from it in one go. The API I have is something
> like String[] getSynonyms(String word).
> >
> > So I have few questions:
> >
> > 1) Did I miss a Filter which does take a programmatic syn-map which I
> can provide my own impl to?
> >
> > 2) If not, Would it make sense to modify SynonymMap to offer
> getSynonyms(word) API (using BytesRef / CharsRef of course), with an
> FSTSynonymMap default impl so that users can provide their own impl, e.g.
> not requiring everything to be in RAM?
> >
> > 2.1) Side-effect benefit, I think, is that we won't require everyone to
> deal with the FST API that way, though I'll admit I cannot think of may use
> cases for not using SynonymFilter as-is ...
> >
> > 3) If the answer to (1) and (2) is NO, I guess my only option is to
> implement my own SynonymFilter, copying most of the code from Lucene's ...
> right?
> >
> > Shai
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Programmatic Synonyms Filter (Lucene and/or Solr)

Reply via email to