Here's the new blog post I mentioned earlier in the thread, trying to explain the recent changes to make multi-token synonyms work ... it just went out today: https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
Mike McCandless http://blog.mikemccandless.com On Tue, Feb 14, 2017 at 6:19 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > Hi Bernd, > > Actually, pos (which is just the accumulation of > PositionIncrementAttribute, starting with -1) is the *start* node. > > The end node is then pos + PositionLengthAttribute. > > As far as I know, ShingleFilter is not yet graph friendly: it does not > set PositionLengthAttribute. But you could visualize how it should be > setting it ... > > Also, note that synonym filter cannot handle an incoming graph > properly, so if you run ShingleFilter before it, it's not going to do > the right thing. For that we need something like > https://issues.apache.org/jira/browse/LUCENE-5012 ... the branch for > that issue already has a SynonymFilter that accepts incoming graphs, > but it's a biggish change. > > The docs are indeed out-dated; I'll repair them. Thank you! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Feb 14, 2017 at 2:41 AM, Bernd Fehling > <bernd.fehl...@uni-bielefeld.de> wrote: >> Now it is getting more clear. >> >> "pos" (aka position) starts at "-1" and its highest number is the >> last "node id" of the graph. >> >> "pos" minus "positionLength" is the starting "node id" of the arc. >> >> Is the tokenStream after each filter always a valid graph? >> >> E.g. ShingleFilter with query "natural forest": >> SF text start end positionLength type position >> natural 0 7 1 word 1 >> natural forest 0 14 2 shingle 1 >> forest 8 14 1 word 2 >> >> (0)--- natural --->(1)--- forest --->(2) >> But how to insert the shingle into this graph? >> >> This is why I added a SynonymPreFilter to correct the graph between >> ShingleFilter and SynonymGraphFilter. But I had the wrong understanding >> of pos, positionIncrement, positionLength,... >> >> >> Another question, the API docs say "...Injecting synonyms – here, >> synonyms of a token should be added after that token..." >> But as I already mentioned the synonyms are added before the token. >> Are the docs outdated? >> >> >> Regards >> Bernd >> >> >> Am 13.02.2017 um 17:31 schrieb Michael McCandless: >>> On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling >>> <bernd.fehl...@uni-bielefeld.de> wrote: >>> >>>> Am I confused by the naming of pos, positionIncrement, offset, >>>> positionLength, >>>> start and end between Lucene and Solr? >>> >>> "pos" is just accumulating the positionIncrement values, starting from >>> -1. I don't think Solr's analysis UI would change the meaning of >>> these attributes. >>> >>>> OK, the SynonymGraphFilter is ONLY for Lucene, right? >>> >>> No, it's also for Solr and Elasticsearch and any other search servers >>> on top of Lucene as well. >>> >>>> But how are you going to build the multi-word synonym query "natürlicher >>>> wald" >>>> from "natural forest"? >>> >>> Lucene's and Elasticsearch's query parsers have already been fixed to >>> correctly handle token graphs by default; Solr has a fork of Lucene's >>> query parser I think ... I'm not sure if it's been fixed yet to >>> interpret graphs. >>> >>> See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and >>> https://issues.apache.org/jira/browse/LUCENE-7638 >>> >>>> And how are you going to highlight a synonym hit for "natürlicher wald" >>>> when start and end is set to 0-14 and not to 0-18? >>>> Or is start and end not used for highlighting? >>> >>> This start/end offset, at query time, is not normally used. If you >>> have a document in the index that has "natürlicher wald" then it would >>> have offsets X to X+18, stored in the index ideally as postings >>> offsets, and should highlight correctly? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>>> Am 13.02.2017 um 14:24 schrieb Michael McCandless: >>>>> Unfortunately, I cannot reproduce the problem with a straight Lucene >>>>> test case. I added a this test case to TestSynonymGraphFilter.java: >>>>> >>>>> https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd >>>>> >>>>> And when I run it, it produces the correct token graph: >>>>> >>>>> TOKEN: naturwald >>>>> offset: 0-14 >>>>> pos: 0-4 >>>>> type: SYNONYM >>>>> >>>>> TOKEN: forêt >>>>> offset: 0-14 >>>>> pos: 0-1 >>>>> type: SYNONYM >>>>> >>>>> TOKEN: natürlicher >>>>> offset: 0-14 >>>>> pos: 0-2 >>>>> type: SYNONYM >>>>> >>>>> TOKEN: natural >>>>> offset: 0-7 >>>>> pos: 0-3 >>>>> type: word >>>>> >>>>> TOKEN: naturelle >>>>> offset: 0-14 >>>>> pos: 1-4 >>>>> type: SYNONYM >>>>> >>>>> TOKEN: wald >>>>> offset: 0-14 >>>>> pos: 2-4 >>>>> type: SYNONYM >>>>> >>>>> TOKEN: forest >>>>> offset: 8-14 >>>>> pos: 3-4 >>>>> type: word >>>>> >>>>> Remember that the "pos: " output above is really "node IDs" and you >>>>> can see the inserted side paths are correct. The offsets are >>>>> necessarily always 0-14 for inserted tokens because that is the span >>>>> of the two original tokens. >>>>> >>>>> Can you try removing the SPF filters in your test? Or otherwise >>>>> simplify your test so it's closer to what my test case is doing? >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless >>>>> <luc...@mikemccandless.com> wrote: >>>>>> Thanks Bernd; I'll see if I can make a test case from this. >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> >>>>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling >>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>> My very simple and small sysonym_test.txt has only one line: >>>>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald >>>>>>> >>>>>>> If I only use WT (WhitespaceTokenizer) and SGF (with >>>>>>> WhitespaceTokenizer) >>>>>>> the result is: >>>>>>> >>>>>>> WT text start end positionLength type position >>>>>>> natural 0 7 1 word 1 >>>>>>> forest 8 14 1 word 2 >>>>>>> >>>>>>> SGF text start end positionLength type position >>>>>>> natural 0 7 3 word 1 >>>>>>> naturelle 0 14 3 SYNONYM 2 >>>>>>> wald 0 14 2 SYNONYM 3 >>>>>>> naturwald 0 14 4 SYNONYM 1 >>>>>>> forêt 0 14 1 SYNONYM 1 >>>>>>> natürlicher 0 14 2 SYNONYM 1 >>>>>>> >>>>>>> forest 8 14 1 word 4 >>>>>>> >>>>>>> The result is some kind of rubbish. >>>>>>> Also note the empty line between "natürlicher" and "forest". >>>>>>> >>>>>>> Anything else I should try, may be with KeywordTokenizer? >>>>>>> >>>>>>> p.s. You might have noticed the SPF filters in my setup. >>>>>>> First is SynonymPreFilter to set all attributes to the right value, >>>>>>> second is SynonymPostFilter to again fix all attribute settings but >>>>>>> also set multi-word synonyms as phrase and also cleanup the result >>>>>>> of SGF. >>>>>>> >>>>>>> Regards >>>>>>> Bernd >>>>>>> >>>>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless: >>>>>>>> Yeah, those tokens should have position length 2. >>>>>>>> >>>>>>>> Can you reduce to a small set of synonyms and text? If you use only >>>>>>>> whitespace tokenizer and SGF does the issue reproduce? >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling >>>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>>>> Example for position end and positionLength of SGF. >>>>>>>>> >>>>>>>>> query: natural forest >>>>>>>>> >>>>>>>>> WT text start end positionLength type position >>>>>>>>> natural 0 7 1 word 1 >>>>>>>>> forest 8 14 1 word 2 >>>>>>>>> ... >>>>>>>>> >>>>>>>>> SPF text start end positionLength type position >>>>>>>>> natural 0 7 1 word 1 >>>>>>>>> natural forest 0 14 2 shingle 2 >>>>>>>>> forest 8 14 1 word 3 >>>>>>>>> >>>>>>>>> SGF text start end positionLength type position >>>>>>>>> natural 0 7 1 word 1 >>>>>>>>> naturwald 0 14 1 SYNONYM 2 >>>>>>>>> forêt naturelle 0 14 1 SYNONYM 2 >>>>>>>>> natürlicher wald 0 14 1 SYNONYM 2 >>>>>>>>> natural forest 0 14 1 shingle 2 >>>>>>>>> forest 8 14 1 word 3 >>>>>>>>> >>>>>>>>> SPF text start end positionLength type position >>>>>>>>> natural 0 7 1 word 1 >>>>>>>>> naturwald 0 9 1 SYNONYM 2 >>>>>>>>> "forêt naturelle" 0 17 2 SYNONYM 2 >>>>>>>>> "natürlicher wald" 0 18 2 SYNONYM 2 >>>>>>>>> "natural forest" 0 16 2 shingle 2 >>>>>>>>> forest 8 14 1 word 3 >>>>>>>>> >>>>>>>>> >>>>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end >>>>>>>>> and positionLength. >>>>>>>>> I suppose that it is not correct? >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Bernd >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>>>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>>>>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>>>>>> I tried SynonymGraphFilter with my setup and it works right away. >>>>>>>>>>> It payed of that I did some modifications on my filters while >>>>>>>>>>> testing 6.3 with my setup. >>>>>>>>>> >>>>>>>>>> Good! >>>>>>>>>> >>>>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>>>>>>>>> to this point, SynonymGraphFilter is a full replacement for >>>>>>>>>>> SynonymFilter. At least for search-time synonym handling. >>>>>>>>>>> >>>>>>>>>>> But this also means there is still some work with the attributes, >>>>>>>>>>> right? >>>>>>>>>>> Position looks good, type and start are no problem anyway, but >>>>>>>>>>> the end position is still wrong and the positionLength for >>>>>>>>>>> multi-word >>>>>>>>>>> synonyms. >>>>>>>>>> >>>>>>>>>> Can you give an example or make a small test case? >>>>>>>>>> PositionLengthAttribute is supposed to be correct coming out of >>>>>>>>>> SynonymGraphFilter. >>>>>>>>>> >>>>>>>>>>> One thing I noticed was that the originating token which "produces" >>>>>>>>>>> synonyms comes out last from SynonymGraphFilter, after the >>>>>>>>>>> "produced" synonyms. >>>>>>>>>>> I will have a look inside with debugger but I guess this is due >>>>>>>>>>> to output buffering of SynonymGraphFilter? >>>>>>>>>> >>>>>>>>>> Yeah they do come out in a different order, which token filters are >>>>>>>>>> allowed to do in general for all tokens leaving from the same >>>>>>>>>> position >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>> Mike McCandless >>>>>>>>>> >>>>>>>>>> http://blog.mikemccandless.com >>>>>>>>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org