My very simple and small sysonym_test.txt has only one line: naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) the result is: WT text start end positionLength type position natural 0 7 1 word 1 forest 8 14 1 word 2 SGF text start end positionLength type position natural 0 7 3 word 1 naturelle 0 14 3 SYNONYM 2 wald 0 14 2 SYNONYM 3 naturwald 0 14 4 SYNONYM 1 forêt 0 14 1 SYNONYM 1 natürlicher 0 14 2 SYNONYM 1 forest 8 14 1 word 4 The result is some kind of rubbish. Also note the empty line between "natürlicher" and "forest". Anything else I should try, may be with KeywordTokenizer? p.s. You might have noticed the SPF filters in my setup. First is SynonymPreFilter to set all attributes to the right value, second is SynonymPostFilter to again fix all attribute settings but also set multi-word synonyms as phrase and also cleanup the result of SGF. Regards Bernd Am 11.02.2017 um 00:45 schrieb Michael McCandless: > Yeah, those tokens should have position length 2. > > Can you reduce to a small set of synonyms and text? If you use only > whitespace tokenizer and SGF does the issue reproduce? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling > <bernd.fehl...@uni-bielefeld.de> wrote: >> Example for position end and positionLength of SGF. >> >> query: natural forest >> >> WT text start end positionLength type position >> natural 0 7 1 word 1 >> forest 8 14 1 word 2 >> ... >> >> SPF text start end positionLength type position >> natural 0 7 1 word 1 >> natural forest 0 14 2 shingle 2 >> forest 8 14 1 word 3 >> >> SGF text start end positionLength type position >> natural 0 7 1 word 1 >> naturwald 0 14 1 SYNONYM 2 >> forêt naturelle 0 14 1 SYNONYM 2 >> natürlicher wald 0 14 1 SYNONYM 2 >> natural forest 0 14 1 shingle 2 >> forest 8 14 1 word 3 >> >> SPF text start end positionLength type position >> natural 0 7 1 word 1 >> naturwald 0 9 1 SYNONYM 2 >> "forêt naturelle" 0 17 2 SYNONYM 2 >> "natürlicher wald" 0 18 2 SYNONYM 2 >> "natural forest" 0 16 2 shingle 2 >> forest 8 14 1 word 3 >> >> >> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end and >> positionLength. >> I suppose that it is not correct? >> >> Regards >> Bernd >> >> >> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>> I tried SynonymGraphFilter with my setup and it works right away. >>>> It payed of that I did some modifications on my filters while >>>> testing 6.3 with my setup. >>> >>> Good! >>> >>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>> to this point, SynonymGraphFilter is a full replacement for >>>> SynonymFilter. At least for search-time synonym handling. >>>> >>>> But this also means there is still some work with the attributes, right? >>>> Position looks good, type and start are no problem anyway, but >>>> the end position is still wrong and the positionLength for multi-word >>>> synonyms. >>> >>> Can you give an example or make a small test case? >>> PositionLengthAttribute is supposed to be correct coming out of >>> SynonymGraphFilter. >>> >>>> One thing I noticed was that the originating token which "produces" >>>> synonyms comes out last from SynonymGraphFilter, after the >>>> "produced" synonyms. >>>> I will have a look inside with debugger but I guess this is due >>>> to output buffering of SynonymGraphFilter? >>> >>> Yeah they do come out in a different order, which token filters are >>> allowed to do in general for all tokens leaving from the same position >>> ... >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> -- >> ************************************************************* >> Bernd Fehling Bielefeld University Library >> Dipl.-Inform. (FH) LibTec - Library Technology >> Universitätsstr. 25 and Knowledge Management >> 33615 Bielefeld >> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de >> >> BASE - Bielefeld Academic Search Engine - www.base-search.net >> ************************************************************* >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- ************************************************************* Bernd Fehling Bielefeld University Library Dipl.-Inform. (FH) LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net ************************************************************* --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org