Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Bernd Fehling
Now it is getting more clear. "pos" (aka position) starts at "-1" and its highest number is the last "node id" of the graph. "pos" minus "positionLength" is the starting "node id" of the arc. Is the tokenStream after each filter always a valid graph? E.g. ShingleFilter with query "natural fores

Re: Proper Use of SynonymGraphFilter

2017-02-13 Thread Corbin, J.D.
Hi Mike, Thanks for the response, Sounds like I was using it incorrectly by specifying the SynonymGraphFilter at query time AND SynonymGraphFilter followed by FlattenGraphFilter at index time. I need to specify one or the other. J.D. J.D. Corbin Senior Research Engineer Advanced Computing &

Re: Proper Use of SynonymGraphFilter

2017-02-13 Thread Michael McCandless
Hi J.D., First you need to decide if it's OK to do all your syns at search time. It results in slower queries, and different scoring, yet correct multi-token results, vs. index time. If that is OK, then you should not use any syn filter at index time, and use only SynonymGraphFilter at search ti

how do i improve Indexing and Searching performance of 2 billion documents over SolrCloud

2017-02-13 Thread yeshwanth kumar
Hi, we have 4 solr instances running we are using solr cloud for indexing hbase table column names. each column in hbase will end up as a document in solr, which resulted in over 2 billion documents in solr. primary goal is to search the column names. we have 4 shards for the collection, queries a

Proper Use of SynonymGraphFilter

2017-02-13 Thread Corbin, J.D.
Hi, I am looking for some guidance on the proper use of the SynonymGraphFilter in Lucene (6.4.1). Below is how I am implementing the analyzers for the index and query sides. I don't see a lot of examples on the proper usage of the SynonymGraphFilter so was hoping that someone (Michael McCandless?

RE: Unable to build Solr 5.5.3 from source

2017-02-13 Thread Uwe Schindler
Hi, I cannot reproduce this with Solr 5.5.4 (coming out soon). With a completely empty ~/.ivy/cache dir it builds and is able to download everything. This error is in most cases caused by stale lock files (*.lck) in the IVY Cache. Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen http://ww

Re: operators within quoted queries

2017-02-13 Thread Erick Erickson
Attach &debug=query to the URL when you fire this query and you'll see exactly how it parses which should help you diagnose the problem. Some places to look: 1> There are options that treat lower case operators as valid. Normally, Solr only treats 'AND' as an operator not 'and' but this can be ove

operators within quoted queries

2017-02-13 Thread Kameron Cole
I have noticed odd behavior in the query "Conceal and Carry" I have legal customers who need to find exactly this phrase because, as you know, it refers to a specific set of gun laws. However, this query is not behaving like a traditional quoted query - my assumption is that a quoted string is

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Michael McCandless
On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling wrote: > Am I confused by the naming of pos, positionIncrement, offset, positionLength, > start and end between Lucene and Solr? "pos" is just accumulating the positionIncrement values, starting from -1. I don't think Solr's analysis UI would chang

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Bernd Fehling
After drawing the graph I must admit it looks correct, including all values. Am I confused by the naming of pos, positionIncrement, offset, positionLength, start and end between Lucene and Solr? OK, the SynonymGraphFilter is ONLY for Lucene, right? But how are you going to build the multi-word s

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-13 Thread Michael McCandless
On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion wrote: > I'd like to construct an Automaton to prefix match against a large set of > strings. I gather a RunAutomation is immutable, thread safe and faster than > Automaton. That's correct. > Are there any other differences between the three Autom

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Michael McCandless
Unfortunately, I cannot reproduce the problem with a straight Lucene test case. I added a this test case to TestSynonymGraphFilter.java: https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd And when I run it, it produces the correct token graph: TOKEN: naturwald offset: 0-1

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Michael McCandless
Thanks Bernd; I'll see if I can make a test case from this. Mike McCandless http://blog.mikemccandless.com On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling wrote: > My very simple and small sysonym_test.txt has only one line: > naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald > >

Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-13 Thread Oliver Mannion
Hi there, I'd like to construct an Automaton to prefix match against a large set of strings. I gather a RunAutomation is immutable, thread safe and faster than Automaton. Are there any other differences between the three Automaton classes, for example, in memory usage? Would the general approach

Re: SynonymFilterFactory deprecated since 6.4.0

2017-02-13 Thread Bernd Fehling
My very simple and small sysonym_test.txt has only one line: naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) the result is: WT text start end positionLength type position natural 0