Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread baris . kazar
Great answer Thanks Michael. Yes the difference was too much > 1G Best regards > On Nov 13, 2020, at 1:49 PM, Michael Sokolov wrote: > > You can't directly compare disk usage across two indexes, even with > the same data. Try re-indexing one of your datasets, and you will see > that the disk s

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Michael Sokolov
You can't directly compare disk usage across two indexes, even with the same data. Try re-indexing one of your datasets, and you will see that the disk size is not the same. Mostly this is due to the way segments are merged varying with some randomness from one run to another, although the size of

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Matt Davis
With Zulia we chose to rewrite fieldName:* queries to hiddenField:fieldName and add all field names that are present to a hidden field automatically as Uwe described as an alternative. It seems to work well. https://github.com/zuliaio/zuliasearch/blob/master/zulia-query-parser/src/main/java/io/zu

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread baris . kazar
Nothing changed between two index generations except the data changed a bit as i described. When Lucene is done generating index, that is what i am reporting as the size of the directory where all index files are stored. I dont know about deleted docs? How do you trace that? yes the queries

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Uwe Schindler
Hi, Solr and Elasticsearch implement the exists query like this, which is fully in line with your investigation: if a field has docvalues it uses DocValuesFieldExistsQuery, if it is a tokenized field it uses the NormsFieldExistsQuery. The negative one is a must-not clause, which is perfectly f

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless
That's great Rob! Thanks for bringing closure. Mike McCandless http://blog.mikemccandless.com On Fri, Nov 13, 2020 at 9:13 AM Rob Audenaerde wrote: > To follow up, based on a quick JMH-test with 2M docs with some random data > I see a speedup of 70% :) > That is a nice friday-afternoon gift,

Fwd: best way (performance wise) to search for field without value?

2020-11-13 Thread Rob Audenaerde
To follow up, based on a quick JMH-test with 2M docs with some random data I see a speedup of 70% :) That is a nice friday-afternoon gift, thanks! For ppl that are interested: I added a BinaryDocValues field like this: doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new BytesRef(0x01;

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless
Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must enable norms on your field to use that. TermRangeQuery is indeed a horribly costly way to execute this, but if you cache the result on each refresh, perhaps it is OK? You could also index a dedicated doc values field indicating t

best way (performance wise) to search for field without value?

2020-11-13 Thread Rob Audenaerde
Hi all, We have implemented some security on our index by adding a field 'groups_allowed' to documents, and wrap a boolean must query around the original query, that checks if one of the given user-groups matches at least one groups_allowed. We chose to leave the groups_allowed field empty when t

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Erick Erickson
What does “final finished sizes” mean? After optimize of just after finishing all indexing? The former is what counts here. And you provided no information on the number of deleted docs in the two cases. Is the number of deletedDocs the same (or close)? And does the q=*:* query return the same