Re: SynonymFilterFactory deprecated since 6.4.0

Michael McCandless Tue, 14 Feb 2017 08:19:04 -0800

Here's the new blog post I mentioned earlier in the thread, trying to
explain the recent changes to make multi-token synonyms work ... it
just went out today:
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch


Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 14, 2017 at 6:19 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> Hi Bernd,
>
> Actually, pos (which is just the accumulation of
> PositionIncrementAttribute, starting with -1) is the *start* node.
>
> The end node is then pos + PositionLengthAttribute.
>
> As far as I know, ShingleFilter is not yet graph friendly: it does not
> set PositionLengthAttribute.  But you could visualize how it should be
> setting it ...
>
> Also, note that synonym filter cannot handle an incoming graph
> properly, so if you run ShingleFilter before it, it's not going to do
> the right thing.  For that we need something like
> https://issues.apache.org/jira/browse/LUCENE-5012 ... the branch for
> that issue already has a SynonymFilter that accepts incoming graphs,
> but it's a biggish change.
>
> The docs are indeed out-dated; I'll repair them.  Thank you!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 2:41 AM, Bernd Fehling
> <bernd.fehl...@uni-bielefeld.de> wrote:
>> Now it is getting more clear.
>>
>> "pos" (aka position) starts at "-1" and its highest number is the
>> last "node id" of the graph.
>>
>> "pos" minus "positionLength" is the starting "node id" of the arc.
>>
>> Is the tokenStream after each filter always a valid graph?
>>
>> E.g. ShingleFilter with query "natural forest":
>> SF        text start  end  positionLength  type    position
>>        natural 0      7    1               word    1
>> natural forest 0      14   2               shingle 1
>>         forest 8      14   1               word    2
>>
>> (0)--- natural --->(1)--- forest --->(2)
>> But how to insert the shingle into this graph?
>>
>> This is why I added a SynonymPreFilter to correct the graph between
>> ShingleFilter and SynonymGraphFilter. But I had the wrong understanding
>> of pos, positionIncrement, positionLength,...
>>
>>
>> Another question, the API docs say "...Injecting synonyms – here,
>> synonyms of a token should be added after that token..."
>> But as I already mentioned the synonyms are added before the token.
>> Are the docs outdated?
>>
>>
>> Regards
>> Bernd
>>
>>
>> Am 13.02.2017 um 17:31 schrieb Michael McCandless:
>>> On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>
>>>> Am I confused by the naming of pos, positionIncrement, offset, 
>>>> positionLength,
>>>> start and end between Lucene and Solr?
>>>
>>> "pos" is just accumulating the positionIncrement values, starting from
>>> -1.  I don't think Solr's analysis UI would change the meaning of
>>> these attributes.
>>>
>>>> OK, the SynonymGraphFilter is ONLY for Lucene, right?
>>>
>>> No, it's also for Solr and Elasticsearch and any other search servers
>>> on top of Lucene as well.
>>>
>>>> But how are you going to build the multi-word synonym query "natürlicher 
>>>> wald"
>>>> from "natural forest"?
>>>
>>> Lucene's and Elasticsearch's query parsers have already been fixed to
>>> correctly handle token graphs by default; Solr has a fork of Lucene's
>>> query parser I think ... I'm not sure if it's been fixed yet to
>>> interpret graphs.
>>>
>>> See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and
>>> https://issues.apache.org/jira/browse/LUCENE-7638
>>>
>>>> And how are you going to highlight a synonym hit for "natürlicher wald"
>>>> when start and end is set to 0-14 and not to 0-18?
>>>> Or is start and end not used for highlighting?
>>>
>>> This start/end offset, at query time, is not normally used.  If you
>>> have a document in the index that has "natürlicher wald" then it would
>>> have offsets X to X+18, stored in the index ideally as postings
>>> offsets, and should highlight correctly?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>> Am 13.02.2017 um 14:24 schrieb Michael McCandless:
>>>>> Unfortunately, I cannot reproduce the problem with a straight Lucene
>>>>> test case.  I added a this test case to TestSynonymGraphFilter.java:
>>>>>
>>>>>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
>>>>>
>>>>> And when I run it, it produces the correct token graph:
>>>>>
>>>>> TOKEN: naturwald
>>>>>   offset: 0-14
>>>>>   pos: 0-4
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: forêt
>>>>>   offset: 0-14
>>>>>   pos: 0-1
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: natürlicher
>>>>>   offset: 0-14
>>>>>   pos: 0-2
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: natural
>>>>>   offset: 0-7
>>>>>   pos: 0-3
>>>>>   type: word
>>>>>
>>>>> TOKEN: naturelle
>>>>>   offset: 0-14
>>>>>   pos: 1-4
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: wald
>>>>>   offset: 0-14
>>>>>   pos: 2-4
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: forest
>>>>>   offset: 8-14
>>>>>   pos: 3-4
>>>>>   type: word
>>>>>
>>>>> Remember that the "pos: " output above is really "node IDs" and you
>>>>> can see the inserted side paths are correct.  The offsets are
>>>>> necessarily always 0-14 for inserted tokens because that is the span
>>>>> of the two original tokens.
>>>>>
>>>>> Can you try removing the SPF filters in your test?  Or otherwise
>>>>> simplify your test so it's closer to what my test case is doing?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
>>>>> <luc...@mikemccandless.com> wrote:
>>>>>> Thanks Bernd; I'll see if I can make a test case from this.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>>> My very simple and small sysonym_test.txt has only one line:
>>>>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>>>>>>>
>>>>>>> If I only use WT (WhitespaceTokenizer) and SGF (with 
>>>>>>> WhitespaceTokenizer)
>>>>>>> the result is:
>>>>>>>
>>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>>      natural     0      7    1               word  1
>>>>>>>       forest     8      14   1               word  2
>>>>>>>
>>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>>      natural     0      7    3               word     1
>>>>>>>    naturelle     0      14   3               SYNONYM  2
>>>>>>>         wald     0      14   2               SYNONYM  3
>>>>>>>    naturwald     0      14   4               SYNONYM  1
>>>>>>>        forêt     0      14   1               SYNONYM  1
>>>>>>>  natürlicher     0      14   2               SYNONYM  1
>>>>>>>
>>>>>>>       forest     8      14   1               word     4
>>>>>>>
>>>>>>> The result is some kind of rubbish.
>>>>>>> Also note the empty line between "natürlicher" and "forest".
>>>>>>>
>>>>>>> Anything else I should try, may be with KeywordTokenizer?
>>>>>>>
>>>>>>> p.s. You might have noticed the SPF filters in my setup.
>>>>>>>      First is SynonymPreFilter to set all attributes to the right value,
>>>>>>>      second is SynonymPostFilter to again fix all attribute settings but
>>>>>>>      also set multi-word synonyms as phrase and also cleanup the result
>>>>>>>      of SGF.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bernd
>>>>>>>
>>>>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>>>>>> Yeah, those tokens should have position length 2.
>>>>>>>>
>>>>>>>> Can you reduce to a small set of synonyms and text?  If you use only
>>>>>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>>>>> Example for position end and positionLength of SGF.
>>>>>>>>>
>>>>>>>>> query: natural forest
>>>>>>>>>
>>>>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>>>>         natural  0      7    1               word  1
>>>>>>>>>         forest   8      14   1               word  2
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>>>>         natural  0      7    1               word     1
>>>>>>>>>  natural forest  0      14   2               shingle  2
>>>>>>>>>         forest   8      14   1               word     3
>>>>>>>>>
>>>>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>>>>         natural  0      7    1               word     1
>>>>>>>>>       naturwald  0      14   1               SYNONYM  2
>>>>>>>>> forêt naturelle  0      14   1               SYNONYM  2
>>>>>>>>> natürlicher wald 0      14   1               SYNONYM  2
>>>>>>>>>  natural forest  0      14   1               shingle  2
>>>>>>>>>          forest  8      14   1               word     3
>>>>>>>>>
>>>>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>>>>         natural  0      7    1               word     1
>>>>>>>>>       naturwald  0      9    1               SYNONYM  2
>>>>>>>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>>>>>>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>>>>>>>> "natural forest" 0      16   2               shingle  2
>>>>>>>>>          forest  8      14   1               word     3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end 
>>>>>>>>> and positionLength.
>>>>>>>>> I suppose that it is not correct?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bernd
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>>>>>>> I tried SynonymGraphFilter with my setup and it works right away.
>>>>>>>>>>> It payed of that I did some modifications on my filters while
>>>>>>>>>>> testing 6.3 with my setup.
>>>>>>>>>>
>>>>>>>>>> Good!
>>>>>>>>>>
>>>>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not
>>>>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up
>>>>>>>>>>> to this point, SynonymGraphFilter is a full replacement for
>>>>>>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>>>>>>
>>>>>>>>>>> But this also means there is still some work with the attributes, 
>>>>>>>>>>> right?
>>>>>>>>>>> Position looks good, type and start are no problem anyway, but
>>>>>>>>>>> the end position is still wrong and the positionLength for 
>>>>>>>>>>> multi-word
>>>>>>>>>>> synonyms.
>>>>>>>>>>
>>>>>>>>>> Can you give an example or make a small test case?
>>>>>>>>>> PositionLengthAttribute is supposed to be correct coming out of
>>>>>>>>>> SynonymGraphFilter.
>>>>>>>>>>
>>>>>>>>>>> One thing I noticed was that the originating token which "produces"
>>>>>>>>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>>>>>>>>> "produced" synonyms.
>>>>>>>>>>> I will have a look inside with debugger but I guess this is due
>>>>>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>>>>>
>>>>>>>>>> Yeah they do come out in a different order, which token filters are
>>>>>>>>>> allowed to do in general for all tokens leaving from the same 
>>>>>>>>>> position
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> Mike McCandless
>>>>>>>>>>
>>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SynonymFilterFactory deprecated since 6.4.0

Reply via email to