Re: multi word synonyms

Paul Elschot Sun, 18 May 2008 10:18:23 -0700

Op Sunday 18 May 2008 16:30:26 schreef Karl Wettin:
> 18 maj 2008 kl. 00.01 skrev Paul Elschot:
> > Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
> >> As far as I know Lucene only handle single word synonyms at index
> >> time. My life would be much simpler if it was possible to add
> >> synonyms that spanned over multiple tokens, such as "lucene in
> >> action"="lia". I have a couple of workarounds that are OK but it
> >> really isn't the same thing when it comes down to the scoring.
> >
> > The simplest solution is to index such synonyms at the first or
> > last or middle position of the source tokens, using a zero position
> > increment for the synonym. Was this one of the workarounds?
>
> I get sloppyFreq problems with that.
>
> > The advantage of the zero position increment is that the original
> > token positions are not affected, so at least there is no influence
> > on scoring because of changes in the original token positions.
>
> I copy a number of fields to a single one. Each such field can be
> represented in a number of languages or aliases in the same language.
>
> [a, b, c, d, e, f], [g, h, i],    [j, k, l ,m]
>                      [o, p]        [u, v]
>                      [q, r, s, t]
>
> It would be great if the phrase query on [f, o, p, u, v] could yeild
> a 0 distance.
>
> If I'd been using the same synonyms for the same phrases in all
> documents at all times the edit distance would be static when
> scoring, but I don't.
>
> The terms of these synonyms are not really compatible with each
> other. For instance [f, g, s, t, j] should not be allowed or at least
> be heavily penalised compared to [f, o, p, j].
>
> Searching a combination of languages should be allowed but preferably
> only one per field copied to the big field. (Disjunction is not
> applicable.)
>
> It is OK the way I have it running now, but more dimensions as
> described above really increases the score quality. I confirmed that
> using permutations of documents and filtering out the "duplicates".
> Now I'm thinking it could be solved using token payloads and a brand
> new MultiDimensionalSpanQuery. Not too different from what you
> suggested way back in
> http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storin
>g-them.-to3918462.html#a3944016


That would mean a term extending tag to indicate that a term is on
an alternative path?

>
> There are some other issues too, but I'm not at liberty to disclose
> too much. I hope it still makes sense?

Yes. I suppose the payload would indicate how much the alternative
path length differs from the original path?

In case you can't disclose more, no answer would off course be ok, too.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multi word synonyms

Reply via email to