Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-14 Thread Michael McCandless
Here's the new blog post I mentioned earlier in the thread, trying to explain the recent changes to make multi-token synonyms work ... it just went out today: https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch Mike McCandless http://blog.mikemccandless.com On Tue

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-14 Thread Michael McCandless
Hi Bernd, Actually, pos (which is just the accumulation of PositionIncrementAttribute, starting with -1) is the *start* node. The end node is then pos + PositionLengthAttribute. As far as I know, ShingleFilter is not yet graph friendly: it does not set PositionLengthAttribute. But you could vis

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Bernd Fehling
Now it is getting more clear. "pos" (aka position) starts at "-1" and its highest number is the last "node id" of the graph. "pos" minus "positionLength" is the starting "node id" of the arc. Is the tokenStream after each filter always a valid graph? E.g. ShingleFilter with query "natural fores

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Michael McCandless
On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling wrote: > Am I confused by the naming of pos, positionIncrement, offset, positionLength, > start and end between Lucene and Solr? "pos" is just accumulating the positionIncrement values, starting from -1. I don't think Solr's analysis UI would chang

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Bernd Fehling
After drawing the graph I must admit it looks correct, including all values. Am I confused by the naming of pos, positionIncrement, offset, positionLength, start and end between Lucene and Solr? OK, the SynonymGraphFilter is ONLY for Lucene, right? But how are you going to build the multi-word s

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Michael McCandless
Unfortunately, I cannot reproduce the problem with a straight Lucene test case. I added a this test case to TestSynonymGraphFilter.java: https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd And when I run it, it produces the correct token graph: TOKEN: naturwald offset: 0-1

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Michael McCandless
Thanks Bernd; I'll see if I can make a test case from this. Mike McCandless http://blog.mikemccandless.com On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling wrote: > My very simple and small sysonym_test.txt has only one line: > naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald > >

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Bernd Fehling
My very simple and small sysonym_test.txt has only one line: naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) the result is: WT text start end positionLength type position natural 0

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-10 Thread Michael McCandless
Yeah, those tokens should have position length 2. Can you reduce to a small set of synonyms and text? If you use only whitespace tokenizer and SGF does the issue reproduce? Mike McCandless http://blog.mikemccandless.com On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling wrote: > Example for pos

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-10 Thread Bernd Fehling
Example for position end and positionLength of SGF. query: natural forest WT text start end positionLength type position natural 0 71 word 1 forest 8 14 1 word 2 ... SPF text start end positionLength type

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-09 Thread Michael McCandless
On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling wrote: > I tried SynonymGraphFilter with my setup and it works right away. > It payed of that I did some modifications on my filters while > testing 6.3 with my setup. Good! > I only replaced SynonymFilter with SynonymGraphFilter and did not > use Fl

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-08 Thread Bernd Fehling
I tried SynonymGraphFilter with my setup and it works right away. It payed of that I did some modifications on my filters while testing 6.3 with my setup. I only replaced SynonymFilter with SynonymGraphFilter and did not use FlattenGraphFilter, pretty simple. So I can confirm that, up to this poin

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-07 Thread Michael McCandless
Thanks for sharing; it looks like a nice set of synonyms! It's good that you already apply them at search-time not index-time. In that case, you should not use the FlattenGraphFilter, because SynonymGraphFilter will produce a correct graph (unlike SynonymFilter) and the Lucene query parsers (not

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-07 Thread Bernd Fehling
Years ago (2007) I've installed Eurovoc Thesaurus to work with our Search Engine as multilingual search (terms and phrases in 22 languages). http://www.ub.uni-bielefeld.de/~befehl/base/solr/InsideBase_eurovocThesaurus.html The synonyms.txt file is 8.8MB in size and gets as FST over 300.000 mappin

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-07 Thread Michael McCandless
That's great that multi-token synonyms are working for you; can you describe how use them? This blog post describes some of the problems: http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html I'm working on another blog post to describe the recent changes ... should be out

SynonymFilterFactory deprecated since 6.4.0

2017-02-07 Thread Bernd Fehling
I just tried Solr 6.4.1 and noticed that SynonymFilterFactory is deprecated, as reported in the logs. I hope that this is just to note that there is also an alternative SynonymGraphFilterFactory now available. And _not_ that SynonymFilterFactory will disappear, because it runs my multi-word Synon