Here's the new blog post I mentioned earlier in the thread, trying to
explain the recent changes to make multi-token synonyms work ... it
just went out today:
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
Mike McCandless
http://blog.mikemccandless.com
On Tue
Hi Bernd,
Actually, pos (which is just the accumulation of
PositionIncrementAttribute, starting with -1) is the *start* node.
The end node is then pos + PositionLengthAttribute.
As far as I know, ShingleFilter is not yet graph friendly: it does not
set PositionLengthAttribute. But you could vis
Now it is getting more clear.
"pos" (aka position) starts at "-1" and its highest number is the
last "node id" of the graph.
"pos" minus "positionLength" is the starting "node id" of the arc.
Is the tokenStream after each filter always a valid graph?
E.g. ShingleFilter with query "natural fores
On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
wrote:
> Am I confused by the naming of pos, positionIncrement, offset, positionLength,
> start and end between Lucene and Solr?
"pos" is just accumulating the positionIncrement values, starting from
-1. I don't think Solr's analysis UI would chang
After drawing the graph I must admit it looks correct, including all values.
Am I confused by the naming of pos, positionIncrement, offset, positionLength,
start and end between Lucene and Solr?
OK, the SynonymGraphFilter is ONLY for Lucene, right?
But how are you going to build the multi-word s
Unfortunately, I cannot reproduce the problem with a straight Lucene
test case. I added a this test case to TestSynonymGraphFilter.java:
https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
And when I run it, it produces the correct token graph:
TOKEN: naturwald
offset: 0-1
Thanks Bernd; I'll see if I can make a test case from this.
Mike McCandless
http://blog.mikemccandless.com
On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
wrote:
> My very simple and small sysonym_test.txt has only one line:
> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>
>
My very simple and small sysonym_test.txt has only one line:
naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
the result is:
WT text start end positionLength type position
natural 0
Yeah, those tokens should have position length 2.
Can you reduce to a small set of synonyms and text? If you use only
whitespace tokenizer and SGF does the issue reproduce?
Mike McCandless
http://blog.mikemccandless.com
On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
wrote:
> Example for pos
Example for position end and positionLength of SGF.
query: natural forest
WT text start end positionLength type position
natural 0 71 word 1
forest 8 14 1 word 2
...
SPF text start end positionLength type
On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
wrote:
> I tried SynonymGraphFilter with my setup and it works right away.
> It payed of that I did some modifications on my filters while
> testing 6.3 with my setup.
Good!
> I only replaced SynonymFilter with SynonymGraphFilter and did not
> use Fl
I tried SynonymGraphFilter with my setup and it works right away.
It payed of that I did some modifications on my filters while
testing 6.3 with my setup.
I only replaced SynonymFilter with SynonymGraphFilter and did not
use FlattenGraphFilter, pretty simple. So I can confirm that, up
to this poin
Thanks for sharing; it looks like a nice set of synonyms!
It's good that you already apply them at search-time not index-time.
In that case, you should not use the FlattenGraphFilter, because
SynonymGraphFilter will produce a correct graph (unlike SynonymFilter)
and the Lucene query parsers (not
Years ago (2007) I've installed Eurovoc Thesaurus to work with our
Search Engine as multilingual search (terms and phrases in 22 languages).
http://www.ub.uni-bielefeld.de/~befehl/base/solr/InsideBase_eurovocThesaurus.html
The synonyms.txt file is 8.8MB in size and gets as FST over 300.000 mappin
That's great that multi-token synonyms are working for you; can you
describe how use them?
This blog post describes some of the problems:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
I'm working on another blog post to describe the recent changes ...
should be out
I just tried Solr 6.4.1 and noticed that SynonymFilterFactory is
deprecated, as reported in the logs.
I hope that this is just to note that there is also an alternative
SynonymGraphFilterFactory now available.
And _not_ that SynonymFilterFactory will disappear, because it runs my
multi-word Synon
16 matches
Mail list logo