Re: What exactly returns IndexReader.numDeletedDocs()

2022-12-08 Thread András Péteri
IIRC, it's the number of documents marked with a "deleted" bit. They are obliterated during merges as segments written during the merge operation no longer include deleted contents. So eg. if you call forceMerge(1), no previous segment is preserved and the deleted count will drop to 0 as a result.

Re: Migration from Lucene 5.5 to 8.11.1

2022-01-13 Thread András Péteri
It looks like Sascha runs IndexUpgrader for all major versions, ie. 6.6.6, 7.7.3 and 8.11.1. File "segments_91" is written by the 7.7.3 run immediately before the error. On Wed, Jan 12, 2022 at 3:44 PM Adrien Grand wrote: > The log says what the problem is: version 8.11.1 cannot read indices > c

Re: how to find out each score contribution from booleanquery components

2019-06-27 Thread András Péteri
Hi Baris, Explanation's output is hierarchical, and the leading "0.0" values you are seeing are the individual contributions of each boolean clause or any other nested query. Going from bottom to top: Term query on countryDFLT = 'states', but no term matched this value --> score is 0.0 for the t

Re: SQL OR in lucene : where ((term1=a and term2=b) OR (term3=a and term4=b)) and context in (2,3,4,5.....200)

2018-08-24 Thread András Péteri
> > But it can be workable, if I manage to apply context condition > separately. > > > > > > More probably using custom filtering through Collector interface > https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/ > search/Collector.html. > > > > > > Any idea please. > > > > > > Regards, > > Khurram > > > > -- > Tomoko Uchida > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- András Péteri

Re: Lucene same search result for worlds with and without spaces

2018-06-20 Thread András Péteri
An n-gram tokenizer/filter might also work for you: http://lucene.apache.org/core/7_3_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html Regards, András On Wed, Jun 20, 2018 at 11:53 AM, Markus Jelsma wrote: > Hi Egorlex, > > Set the tokenSeparator to "" and ShingleFilter w

Re: Encryption At Rest - Using CustomAnalyzer

2018-02-06 Thread András Péteri
Hi Avarinth, There is an open issue to encrypt index files using AES, don't know if that would fit your requirements: https://issues.apache.org/jira/browse/LUCENE-2228 Regards, András On Tue, Feb 6, 2018 at 8:32 AM, Michael Wilkowski wrote: > Hi, > sorry to say that, but your encryption is not

Re: Maintaining sorting order (stored fields vs DocValue fields) while upgrading Lucene version

2017-07-02 Thread András Péteri
Hi, Note that If you are using Lucene directly, 5.x introduced LUCENE-6064 [1] [2], which adds checks to ensure that the sort field has a corresponding DocValue of the expected type. Indexed fields can only be used for sorting via an UninvertingReader, at a cost of increased heap usage [3]. Solr h

Re: Non-index files under the search directory

2016-11-24 Thread András Péteri
ess the solution > should be explicitly use getCommitData for each sub-index, then set it into > new consolidated search database, right? > > Best, > > --Xiaolong > > > On Tue, Nov 22, 2016 at 12:10 PM, András Péteri > wrote: > >> Hi Xiaolong, >> >> A Map o

Re: Non-index files under the search directory

2016-11-22 Thread András Péteri
> I am wondering does indexwriter can also merge this non-index file while >> it >> > merging multiple search index? >> > >> > And if I am stepping back a little bit, what's is the best way t

Re: Are "position" and "position increment" actually the exact same concept?

2016-02-08 Thread András Péteri
ter all? > > > > TX > > > > ----- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > -- András Péteri

Re: Quiz question: Which Character.isSpaceChar but not isWhitespace?

2015-11-01 Thread András Péteri
rch. It’s caused all sorts of head-scratching > till we discovered what’s going on. > > Craziness. > > ~ David > -- > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > http://www.solrenterprisesearchserver.com > -- András Péteri

Re: ConjunctionScorer access

2015-10-22 Thread András Péteri
l.com] > > >>> Sent: Wednesday, October 21, 2015 7:03 PM > > >>> To: java-user@lucene.apache.org > > >>> Subject: ConjunctionScorer access > > >>> > > >>> It's a bummer Lucene makes the constructor of ConjunctionScorer non- > > >>> public. I wanted to extend from this class in order to tweak its > > >> behavior for > > >>> my use case. Is it possible to change it to protected in future > > releases > > >> ? > > >> > > >> > > >> - > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > -- András Péteri

Re: IndexWriter is not closing the FDs (deleted files)

2015-09-01 Thread András Péteri
Hi Napoli, You could also create an instance of SearcherManager [1], and let it take care of tracking IndexSearchers; it can also be use to reopen the underlying readers, and close them when they are no longer in use. Calling maybeRefresh() or maybeRefreshBlocking() on the manager ensures that a r

Re: Mapping doc values back to doc ID (in decent time)

2015-08-09 Thread András Péteri
If I understand it correctly, the Zoie library [1][2] implements the "sledgehammer" approach by collecting docValues for all documents when a segment reader is opened. If you have some RAM to throw at the problem, this could indeed bring you an acceptable level of performance. [1] http://senseidb.

Re: ignore score and weight in lucene search

2015-07-30 Thread András Péteri
Collector's javadoc in Lucene 4.x includes a bare minimum example which only registers matching documents in a bitset: https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_4/lucene/core/src/java/org/apache/lucene/search/Collector.java#L85 You'll have to adapt this if you want to use it in L

Re: Lucene 5: Wrapping Collector

2015-06-29 Thread András Péteri
Hi, IndexSearcher.search(Query, Collector) will iterate through all segments of the index, call getLeafCollector, and use the returned LeafCollector to collect result documents from that segment [1]. As LeafCollector's javadoc describes [2], there are cases when you want to take into account prec

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread András Péteri
As Olivier wrote, multiple BytesRef instances can share the underlying byte array when representing slices of existing data, for performance reasons. BytesRef#clone()'s javadoc comment says that the result will be a shallow clone, sharing the backing array with the original instance, and points to

Re: understanding the norm encode and decode

2015-03-05 Thread András Péteri
Sorry, I also got it wrong in the previous message. :) It goes 0.89f -> 123 -> 0.875f. On Thu, Mar 5, 2015 at 10:08 AM, András Péteri wrote: > Hi Andrew, > > If you are using Lucene 3.6.1, you can take a look at the method which > creates a single byte value out of the receiv

Re: understanding the norm encode and decode

2015-03-05 Thread András Péteri
Hi Andrew, If you are using Lucene 3.6.1, you can take a look at the method which creates a single byte value out of the received float using bit manipulation at [1]. There is also a 256-element decoder table in Similarity, where each byte corresponds to a decoded float value computed by [2]. The

Throwing CollectionTerminatedException from Collector.getLeafCollector

2015-03-02 Thread András Péteri
Hi, According to IndexSearcher's code [1], if a Collector implementation is not interested in collecting document hits from a particular leaf reader, it can also throw CollectionTerminatedException from Collector.getLeafCollector(LeafReaderContext). This option is however not described in Collecto

Re: Lucene 4.x -> 5 : IllegalStateException while sorting

2015-02-23 Thread András Péteri
Hi Clemens, I think this part of the release notes [1] applies to your case: * FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming

Re: Query nested document

2014-10-20 Thread András Péteri
Hello Aurélien, I believe the approach you described is what Elasticsearch is taking with nested documents, in addition to indexing parent and child documents in a single block. See the "sidebar" at the bottom of [1] and the sections labeled "nested" of [2] for more details. Michael's blog post o

Merge policy for branching data model

2014-01-05 Thread András Péteri
Hello, Our application uses Lucene to index documents received from a back-end that supports storage of temporal data with branches, similar to revision control systems like SVN: when looking at a single object, one can choose to either retrieve the current state, go back to a previous point in ti