Re: Temporary vector file during merging

2025-06-27 Thread Adrien Grand
Mike, I believe that the answer to your question is in this PR review comment: https://github.com/apache/lucene/pull/601#discussion_r783711025. Merging is currently implemented by looping over fields once, and merging them. Writing the vec file first would require merging flat vectors for all fiel

Re: Perf comparision for MMapDirectory Vs NIOFSDirectory

2025-05-21 Thread Adrien Grand
Hello Ashwini, MMapDirectory will often perform a bit faster. While NIOFSDirectory needs to first copy data from the buffer cache to heap arrays, MMapDirectory can read directly into the buffer cache. Lucene's benchmark suite allows comparing these two directories. I haven't done so recently, but

Re: Question about ImpactsDISI for boolean queries

2025-04-21 Thread Adrien Grand
You are on the right track. It's easier to skip by score when there is a single scoring clause than when the score is the sum of the scores of two clauses. Well, actually in this case two clauses are not much harder since one of the clauses gives the same score to all documents, but the conjunctio

Re: Query on SoftUpdateDocument API

2025-02-21 Thread Adrien Grand
Hi Abhishek, Actually softUpdate is about doing an update where the deletion is performed via a soft delete rather than a hard delete. To perform doc-value updates, you need to use the updateNumericDocValue or updateBinaryDocValue APIs. Note that it doesn't actually update in-place, it needs to

Re: Looking for resources to understand query cost/complexity

2025-02-21 Thread Adrien Grand
This depends on many factors, but in my experience these two are good starting points: - Total number of matching docs of the query. - Number of segments times number of terms being looked up. This is a simplified model, some queries incur their own costs, e.g. phrase queries bottleneck on evalu

Re: How to retrieve vectors from the IndexReader

2025-02-10 Thread Adrien Grand
Hi Viliam, Your logic is mostly correct, here is a version that should be a bit simpler and correct (but beware, untested): IndexReader reader; // your multi-reader int docID; // top-level doc ID int readerID = ReaderUtil.subIndex(docID, reader.leaves()); LeafReaderContext leafContext = reader.le

Re: IndexFormatTooOldException

2024-12-19 Thread Adrien Grand
Hi Ian, Indeed Lucene has been maintaining read-only support for 8.x indices lately, see this method which lets you opt in for this: https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/index/DirectoryReader.html#open(org.apache.lucene.index.IndexCommit,int,java.util.Comparator) . So if y

Re: Lucene Query Metrics

2024-12-03 Thread Adrien Grand
Lucene doesn't expose query metrics, it's up to the application that integrates Lucene to compute and expose metrics that are relevant to them. Le mer. 4 déc. 2024, 00:31, ashwini singh a écrit : > Hey everyone, > > Does lucene provide any query metrics (perf) ? I am looking for something > very

Re: How to find RAM/disk usage of each vector field

2024-11-05 Thread Adrien Grand
I cannot think of good ways to do this. Why is it important to break down per field as opposed to scaling based on the total volume of vector data? On Tue, Nov 5, 2024 at 10:58 PM Tanmay Goel wrote: > Hi Rui > > Thanks for your response and the snippet that you shared is great but not > exactly

Re: Indexing multiple numeric ranges

2024-11-05 Thread Adrien Grand
Hello Siraj, You can do this by creating a Lucene document that has 3 org.apache.lucene.document.IntRange fields in it, one for each of the ranges that you would like to index. Lucene will then match the document if any of the ranges matches. On Tue, Nov 5, 2024 at 5:16 PM Siraj Haider wrote: >

Re: MaxScoreBulkScorer increased latency for a extreme test case (many SHOULD and each SHOULD clause matches all docs)

2024-10-14 Thread Adrien Grand
ht, the Max Conjunction scorer shows up in the flamegraph for >> 12 MUST clauses: >> https://htmlpreview.github.io/?https://github.com/wurui90/scratch/blob/main/flamegraphs/exists-with-limit-200-lucene911-must.html >> >> >> Thanks! >> >> On Fri, Sep 20, 2024

Re: MaxScoreBulkScorer increased latency for a extreme test case (many SHOULD and each SHOULD clause matches all docs)

2024-09-20 Thread Adrien Grand
> changed to 12 MUST clauses, the problem is the same: it collects 3.6M docs > on Lucene911 but 1001 docs on Lucene97. Does this data point align with how > MaxScoreBulkScorer works? > > Thanks! > > On Wed, Sep 18, 2024 at 1:51 AM Adrien Grand wrote: > >> Than

Re: MaxScoreBulkScorer increased latency for a extreme test case (many SHOULD and each SHOULD clause matches all docs)

2024-09-18 Thread Adrien Grand
Rui Wu wrote: > >> This query latency increased from 14.65 to 20.90ms. >> >> We use the `TopScoreDocCollector.createSharedManager(/*batchSize*/ 101, >> /*searchAfterFieldDoc*/ null, /*hitsThreshold*/ 1000); ` >> >> Thanks a lot! >> >> On Tue, Sep 17

Re: MaxScoreBulkScorer increased latency for a extreme test case (many SHOULD and each SHOULD clause matches all docs)

2024-09-17 Thread Adrien Grand
t; > Thanks for looking into this! Here are more screenshots of the flamegraph. > The original flamegraph HTMLs have stack traces from our app so I don't > share it here. > [image: Screenshot 2024-09-17 at 1.13.07 AM.png][image: Screenshot > 2024-09-17 at 1.12.01 AM.png] > > O

Re: MaxScoreBulkScorer increased latency for a extreme test case (many SHOULD and each SHOULD clause matches all docs)

2024-09-17 Thread Adrien Grand
Hello Rui, We actually released a change that should make MaxScoreBulkScorer faster on dense disjunctions in 9.8: https://github.com/apache/lucene/pull/12444. Your benchmark case is quite specific though as all clauses match all docs and produce constant scores, so I would expect the scorer to qui

Re: KnnQueries and result discrepancy between indexes with the same data

2024-09-12 Thread Adrien Grand
Indeed, the load order can influence Lucene's approximate nearest neighbor search results. If your two indexes load data sequentially and in the same order, then I believe that you would get the same results. But we consider this an implementation detail rather than a guarantee that Lucene should

Re: Question about the performance of Lucene99PostingsFormat

2024-09-10 Thread Adrien Grand
Can you clarify what you refer to by match-all and match-many queries? Lucene's MatchAllDocsQuery should not be impacted since it doesn't use postings for evaluation. Since FOR is a bit less space-efficient than PFOR, I guess it could be a bit slower if your Directory abstraction was a bit slow at

Re: Slow HNSW creation times.

2024-04-28 Thread Adrien Grand
Hello Kannan, The fact that adding 10k docs to an empty HNSW graph is faster than adding 10k docs to a large HNSW graph sounds expected to me, but the 120x factor that you are reporting sounds high. Maybe your dataset is larger than the size of your page cache, forcing your OS to read vectors from

Re: Indexing time increase moving from Lucene 8 to 9

2024-04-17 Thread Adrien Grand
Hi Marc, Nothing jumps to mind as a potential cause for this 2x regression. It would be interesting to look at a profile. On Wed, Apr 17, 2024 at 9:32 PM Marc Davenport wrote: > Hello, > I'm finally migrating Lucene from 8.11.2 to 9.10.0 as our overall build can > now support Java 11. The quick

Re: Query Optimization in search/searchAfter

2024-04-12 Thread Adrien Grand
to filter out > documents but I specifically was talking about the query rewriting phase. > Is the query rewritten differently in search vs searchAfter? Looking at the > code I think no but would just like to confirm if there are any edge cases > here. > > On Fri, Apr 12, 2024 at

Re: Query Optimization in search/searchAfter

2024-04-12 Thread Adrien Grand
Hello Puneeth, When you pass an `after` doc, Lucene will filter out documents that compare better than this `after` document if it can. See e.g. what LongComparator does with its `topValue`, which is the value of the `after` doc. On Thu, Apr 11, 2024 at 4:34 PM Puneeth Bikkumanla wrote: > Hello

Re: Support of RRF (Reciprocal Rank Fusion) by Lucene?

2024-03-26 Thread Adrien Grand
iscuss in more detail > > https://github.com/apache/lucene/issues > > Thanks > > Michael > > Am 26.03.24 um 14:56 schrieb Adrien Grand: > > Hey Michael, > > > > I agree that it would be a nice addition. Plus it should be pretty easy > to > > impl

Re: Support of RRF (Reciprocal Rank Fusion) by Lucene?

2024-03-26 Thread Adrien Grand
Hey Michael, I agree that it would be a nice addition. Plus it should be pretty easy to implement. This sounds like a good fit for a utility method on the TopDocs class? On Tue, Mar 26, 2024 at 2:54 PM Michael Wechner wrote: > Hi > > IIUC Lucene does not contain a RRF implementation, for exampl

[ANNOUNCE] Apache Lucene 9.10.0 released

2024-02-20 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.10. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-ne

Re: Old codecs may only be used for reading

2024-01-11 Thread Adrien Grand
Hey Michael. Your understanding is correct. On Thu, Jan 11, 2024 at 10:46 AM Michael Wechner wrote: > Hi > > I recently upgraded from Lucene 9.8.0 to Lucene 9.9.1 and noticed that > Lucene95Codec got moved to > > org.apache.lucene.backward_codecs.lucene95.Lucene95Codec > > When testing my code I

Re: Assertion error with NumericDocValues.advanceExact

2024-01-01 Thread Adrien Grand
Hello, Can you check if you are running advanceExact on decreasing doc IDs or on doc IDs that are outside of the valid range [0, maxDoc)? If you have Lucene's test framework on your classpath, these checks can be added automatically by using AssertingIndexSearcher instead of IndexSearcher to run q

Re: migrate index from 6 to 9

2023-12-18 Thread Adrien Grand
Hi Vincent, Unfortunately, your assumption is incorrect, Lucene 9 is not able to search Lucene 6 indexes as Lucene only keeps read access to indexes created by the current (9) or previous major version (8). You will need to reindex your 6.x index with Lucene 8 or 9 (preferred) to be able to search

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Adrien Grand
FYI there is also KeywordField, which combines StringField and SortedSetDocValuesField. It supports filtering, sorting, faceting and retrieval. It's my go-to field for string values. Le ven. 20 oct. 2023, 12:20, Michael McCandless a écrit : > There are some differences. > > StringField is indexe

Re: Exception from the codec layer during indexing

2023-09-28 Thread Adrien Grand
Hi Rahul, This exception complains that IndexingChain did not deduplicate terms as expected. I don't recall seeing this exception before (which doesn't mean it's not a real bug). What JVM are you running? Does this exception frequently occur or was it a one-off? On Thu, Sep 28, 2023 at 4:49 PM

Re: forceMerge(1) leads to ~10% perf gains

2023-09-22 Thread Adrien Grand
> Was wondering - are there any other techniques which can be used to speed up that work well when forceMerge works like this? Lucene 9.8 (to be released in a few days hopefully) will add support to recursive graph bisection, which is another thing that can be used to speed up querying on read-onl

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-26 Thread Adrien Grand
ast 5x compared to old code. > Is there any thoughts on why term frequency calls on PostingsEnum are that > slow ? > > > > *Thanks and Regards,* > *Vimal Jain* > > > On Wed, Jun 21, 2023 at 1:43 PM Adrien Grand wrote: > > > As far as your performance problem i

[ANNOUNCE] Apache Lucene 9.7.0 released

2023-06-26 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.7.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-n

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-21 Thread Adrien Grand
. Looks like this > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3. > A side question , is there any resource to help migrate newer major version > , i see lot of api changed from v7 to v8. > > *Thanks and Regards,* > *Vimal Jain* > > > On Wed, Jun

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
rm query over merged field. > Can you please provide more details on what do you mean by dynamic pruning > in context of custom term query ? > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, wrote: > > > Intuitively replacing a disjunction across multiple fields with a sing

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
earlier implementation ( with > multiple term queries ). > > > *Thanks and Regards,* > *Vimal Jain* > > > On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand wrote: > > > You say you observed a performance drop, what are you comparing against? > > > > L

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
You say you observed a performance drop, what are you comparing against? Le mar. 20 juin 2023, 08:59, Vimal Jain a écrit : > Note - i am using lucene 7.7.3 > > *Thanks and Regards,* > *Vimal Jain* > > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain wrote: > > > Hi, > > I want to understand if fet

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-07 Thread Adrien Grand
to large indexes this would be a legit regression, no? > > - Rahul > > On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand wrote: > > > Yes, this changed in 8.x: > > - 8.0 moved the terms index off-heap for non-PK fields with > > MMapDirectory. https://github.com/apache/l

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
indows. I understand it is because of the Java bug which synchronizes > internally in the native call for NIOFs. > > -Rahul > > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand wrote: > > > +Alan Woodward helped me better understand what is going on here. > > BufferedIndexIn

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
just before what the buffer contains. On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand wrote: > > My best guess based on your description of the issue is that > SimpleFSDirectory doesn't like the fact that the terms index now reads > data directly from the directory instead of load

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
My best guess based on your description of the issue is that SimpleFSDirectory doesn't like the fact that the terms index now reads data directly from the directory instead of loading the terms index in heap. Would you be able to run the same benchmark with MMapDirectory to check if it addresses th

Re: Mix of lucene50 and lucene70 codes

2023-04-08 Thread Adrien Grand
Hi, This is normal. Lucene usually names codecs and file formats after the first version that they were introduced in. But not all file formats change on every version, and the Lucene 7.7.3 default postings format was called Lucene50. On Sat, Apr 8, 2023 at 4:17 PM Vimal Jain wrote: > > Hi Guys,

Re: Change score with distance SortField

2023-02-06 Thread Adrien Grand
Hi Michal, The best way to do this would be to put a LatLonPoint#newDistanceFeatureQuery in a SHOULD clause. It's not as flexible as leveraging expressions, but it has the benefit of not disabling dynamic pruning. On Mon, Feb 6, 2023 at 10:33 AM Michal Hlavac wrote: > > Hi, > I would like to inf

Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-14 Thread Adrien Grand
Hi Michael, You could create a custom KNN vectors format that ignores the vector similarity configured on the field and uses its own. Le sam. 14 janv. 2023, 21:33, Michael Wechner a écrit : > Hi > > IIUC Lucene currently supports > > VectorSimilarityFunction.COSINE > VectorSimilarityFunction.DO

Re: The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Adrien Grand
This is correct. See IndexSearcher#getDefaultSimilarity(). On Wed, Nov 23, 2022 at 10:53 AM Michael Wechner wrote: > > Hi > > On the Lucene FAQ there is no mentioning re tf-idf or bm25 and I would > like to add some notes, but to be sure I don't write anything wrong I > would like to ask > > whet

[ANNOUNCE] Apache Lucene 9.4.2 released

2022-11-23 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.2 Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-ne

Re: Sort by numeric field, order missing values before anything else

2022-11-21 Thread Adrien Grand
mail, keep in mind: When > you sort against the raw bytes (using NumericUtils) with SORTED_SET > docvalues type, there is a large overhead on indexing and sorting > performance, especially for the case where you have many different > values in your index (which is likely for num

Re: Sort by numeric field, order missing values before anything else

2022-11-16 Thread Adrien Grand
Hi Petko, Lucene's comparators for numerics have this limitation indeed. We haven't got many questions around that in the past, which I would guess is due to the fact that most numeric fields do not use the entire long range, specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these

Re: Learning Lucene from ground up

2022-11-07 Thread Adrien Grand
+1 to MyCoy's suggestion. To answer your most immediate questions: - Lucene mostly loads metadata in memory at the time of opening a segment (dvm, tmd, fdm, vem, nvm, kdm files), other files are memory-mapped and Lucene relies on the filesystem cache to have their data efficiently available. This

Re: Efficient sort on SortedDocValues

2022-11-07 Thread Adrien Grand
Hi Andrei, The case that you are describing got optimized in Lucene 9.4.0 in the case when your field is also indexed with a StringField: https://github.com/apache/lucene/pull/1023. See annotation ER at http://people.apache.org/~mikemccand/lucenebench/TermMonthSort.html. The way it works is that

Re: Upgrading from 9.1.0. to 9.4.0: Old codecs may only be used for reading Lucene91HnswVectorsFormat.java

2022-10-01 Thread Adrien Grand
ould not forget again > during the next upgrade :-) > > Or what is the best practice re setting / handling the codec? > > Thanks > > Michael > > Am 01.10.22 um 08:06 schrieb Adrien Grand: > > I would guess that you are configuring your IndexWriterConfig with a >

Re: Upgrading from 9.1.0. to 9.4.0: Old codecs may only be used for reading Lucene91HnswVectorsFormat.java

2022-09-30 Thread Adrien Grand
I would guess that you are configuring your IndexWriterConfig with a "Lucene91Codec" instance. You need to replace it with a "Lucene94Codec" instance. Le sam. 1 oct. 2022, 06:12, Michael Wechner a écrit : > Hi > > I have just upgraded from 9.1.0 to 9.4.0 and compiling works fine, but > when I ru

Re: Max Field Length

2022-09-23 Thread Adrien Grand
We have a TruncateTokenFilter in lucene/analysis/common. :) On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov wrote: > I wonder if it would make sense to provide a TruncationFilter in > addition to the LengthFilter. That way long tokens in source text > could be better supported, albeit with some

Re: Questions about Lucene source

2022-09-23 Thread Adrien Grand
On the 2nd question, we do not plan on leveraging this information to figure out the codec: the codec that should be used to read a segment is stored separately (also in segment infos). It is mostly useful for diagnostics purposes. E.g. if we see an interesting corruption case where checksums matc

Re: Max Field Length

2022-09-23 Thread Adrien Grand
Hi Scott, There is no way to lift this limit. The assumption is that a user would never type a 32kB keyword in a search bar, so indexing such long keywords is wasteful. Some tokenizers like StandardTokenizer can be configured to limit the length of the tokens that they produce, there is also a Len

Re: Lucene's LRU Query Cache - Deep Dive

2022-07-19 Thread Adrien Grand
0 > < > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/search/ScorerSupplier.java#L39-L40 > > > > > Regards, > Mohammad Sadiq > > > > On 11 Jul 2022, at 10:37, Adrien Grand wrote: >

Re: Lucene Disable scoring

2022-07-11 Thread Adrien Grand
Note that Lucene automatically disables scoring already when scores are not needed. E.g. queries that compute the top-k hits by score will definitely compute scores, but if you are just counting the number of matches of a query or aggregations, then Lucene skips scoring entirely already. Is there

Re: Lucene's LRU Query Cache - Deep Dive

2022-07-11 Thread Adrien Grand
Hey Shradha, This correctly describes the what, but I think it could add more color about why the cache behaves this way to be more useful, e.g. - Why doesn't the cache cache all queries? Lucene is relatively good at evaluating a subset of the matching documents, e.g. queries sorted by numeric fi

Re: Question about Benchmark

2022-05-16 Thread Adrien Grand
Hi Balmukund, What benchmark are you talking about? On Mon, May 16, 2022 at 4:35 PM balmukund mandal wrote: > > Hi All, > I was trying to run the benchmark and had a couple of questions. Indexing > takes a long time, so is there a way to configure the benchmark to use an > already existing index

Re: Index corruption and repair

2022-04-28 Thread Adrien Grand
Hi Anthony, This isn't something that you should try to fix programmatically, corruptions indicate that something is wrong with the environment, like a broken disk or corrupt RAM. I would suggest running a memtest to check your RAM and looking at system logs in case they have anything to tell abou

Re: How to propose a new feature

2022-04-01 Thread Adrien Grand
Just send an email with the problem that you want to solve and the approach that you are suggesting. On Fri, Apr 1, 2022 at 6:56 PM Baris Kazar wrote: > > Resent due to need for help. > Thanks > > From: Baris Kazar > Sent: Wednesday, March 30, 2022 2:30 PM > To: j

Re: TF in MoreLikeThis

2022-04-01 Thread Adrien Grand
>From a quick look, your suggestion of passing the term frequency to TFIDFSimilarity#tf makes sense. Would you like to contribute this change? You can find contributing guidelines here: https://github.com/apache/lucene/blob/main/CONTRIBUTING.md. On Thu, Mar 31, 2022 at 11:46 PM Petko Minkov wrot

Re: Call for Presentations now open, ApacheCon North America 2022

2022-03-31 Thread Adrien Grand
Thanks Michael for helping spread the word about Lucene's new vector search capabilities! On Thu, Mar 31, 2022 at 7:36 AM Michael Wechner wrote: > > ok :-) thanks! > > Anyway, if somebody would like to join re a "vector search" proposal, > please let me know > > Michael > > Am 30.03.22 um 20:13 s

Re: Re: Custom scores and sort

2022-03-23 Thread Adrien Grand
nt > contains only one "only once score" field, > Lucene passes the CustomScoreProvider's customScore method twice, so the > score = 0 and it seems to me that this value is retained for the sort score. > > I did not find why a TopFieldDocs search (with Sort = SortFiel

Re: LongDistanceFeatureQuery for DoublePoint

2022-03-23 Thread Adrien Grand
Hi Puneeth, Doubles are always a bit more tricky due to rounding for arithmetic operations, but this should still be doable. Out of curiosity, what sort of data do your double fields store? This query had been added with the idea that it would be useful for timestamp fields in order to boost hits

[ANNOUNCE] Apache Lucene 9.1.0 released

2022-03-22 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.1.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-n

Re: FacetsCollector ScoreMode

2022-03-21 Thread Adrien Grand
+1 to adjusting the ScoreMode based on keepScores. On Mon, Mar 21, 2022 at 5:47 PM Mike Drob wrote: > > Hey all, > > I was looking into some performance issues and was a little confused about > one aspect of FacetsCollector - why does it always specify > ScoreMode.COMPLETE? > > Especially for the

Re: Custom scores and sort

2022-03-14 Thread Adrien Grand
It's a bit hard for me to parse what you are trying to do, but it looks like you are making assumptions about how Lucene works internally that are not correct. Do I understand correctly that your scoring mechanism has dependencies on other documents, ie. the score of a document could depend on the

Re: DocValuesIterator: advance vs advanceExact

2022-02-03 Thread Adrien Grand
Hi Alexander, In general, advance(target) is best used to implement queries and advanceExact(target) for collectors. See javadocs for advanceExact(target), this method may only be called on doc IDs that are between 0 included and maxDoc excluded. On Thu, Feb 3, 2022 at 10:00 AM Alexander Buloich

Re: Lucene 6.5.1 source code

2022-02-01 Thread Adrien Grand
You can find the 6.5.1 source code on the old lucene-solr repository: https://github.com/apache/lucene-solr/tree/releases/lucene-solr%2F6.5.1 On Tue, Feb 1, 2022 at 2:54 PM Omri wrote: > > It seems that the old versions branches in github were deleted. > There is a way to see Lucene 6.5.1 source

Re: Migration from Lucene 5.5 to 8.11.1

2022-01-12 Thread Adrien Grand
The log says what the problem is: version 8.11.1 cannot read indices created by Lucene 5.5, you will need to reindex your data. On Wed, Jan 12, 2022 at 3:41 PM wrote: > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene

Re: Want explanation on lucene norms

2022-01-05 Thread Adrien Grand
Hi, Norms are inputs to the score that are independent from the query. It is typically computed as a function of the number of terms of a document: the more terms, the higher the normalization factor and the lower the score. Lucene computes and indexes length normalization factors automatically f

Re: Lucene 9.0.0 inconsistent index options

2021-12-14 Thread Adrien Grand
This looks related to the new changes around schema validation. Lucene now requires a field to either be absent from a document or be indexed with the exact same options (index options, points dimensions, norms, doc values type, etc.) as already indexed documents that also have this field. However

[ANNOUNCE] Apache Lucene 9.0.0 released

2021-12-07 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-nei

Re: index file of lucene8.7 is larger than the 7.7

2021-12-07 Thread Adrien Grand
As a disclaimer, it can be misleading to draw conclusions on space efficiency based on such a small index. Can you compare file sizes by extension across 7.7 and 8.7? You might need to call IndexWriterConfig#setUseCompoundFile(false) to prevent the flush from wrapping your segment files in a compo

[ANNOUNCE] Apache Lucene 8.11.0 released

2021-11-16 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 8.11. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. This r

Re: Need help on aggregation of nested documents

2021-11-16 Thread Adrien Grand
eader.document(int docID) right?. If that is the case won't getting all > the documents would be a costly operation and then finally doing the > aggregates. > > Is there any other way around this? > > Thanks > Gopal Sharma > > > > > > > > On Mon, N

Re: Need help on aggregation of nested documents

2021-11-15 Thread Adrien Grand
It's not straightforward as we don't provide high-level tooling to do this. You need to use the BitSetProducer that you pass to the ToParentBlockJoinQuery in order to resolve the range of child doc IDs for a given parent doc ID (see e.g. how ToChildBlockJoinQuery does it), and then aggregate over t

Re: Using setIndexSort on a binary field

2021-10-15 Thread Adrien Grand
Hi Alex, You need to use a BinaryDocValuesField so that the field is indexed with doc values. `Field` is not going to work because it only indexes the data while index sorting requires doc values. On Fri, Oct 15, 2021 at 6:40 PM Alex K wrote: > Hi all, > > Could someone point me to an example

Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-05 Thread Adrien Grand
gt; Next: BulkScorer.score() with its call tree and time spent: > > > > BulkScorer.score() > -->> Weight$DefaultBulkScorer.score() > -->>-->> Weight$DefaultBulkScorer.scoreAll() > -->>-->>-->> WANDScorer$1.nextDoc() > -->>-->>-->&

Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-01 Thread Adrien Grand
Is your profiler reporting inclusive or exclusive costs for each function? Ie. does it exclude time spent in functions that are called within a function? I'm asking because it makes total sense for IndexSearcher#search to spend most of its time is BulkScorer#score, which coordinates the whole match

Re: Querying into a Collector visits documents multiple times

2021-09-22 Thread Adrien Grand
Hi Steven, This collector looks correct to me. Resetting the counter to 0 on the first segment is indeed not necessary. We have plenty of collectors that are very similar to this one and we never observed any double-counting issue. I would suspect an issue in the code that calls this collector. M

Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Adrien Grand
ute a max score for a block? > > On Thu, Sep 16, 2021 at 12:41 PM Adrien Grand wrote: > > > > Hello, > > > > You are correct that the contribution would be additive in that case. We > > don't provide an easy way to make the contribution multiplicative. &g

Re: Adding vs multiplicating scores when implementing "recency"

2021-09-16 Thread Adrien Grand
Hello, You are correct that the contribution would be additive in that case. We don't provide an easy way to make the contribution multiplicative. There is some debate about what is the best way to combine BM25 scores with query-independent features, though in the discussions I've seen contributi

Re: How exactly the normalized length of the documents are stored in the index

2021-07-13 Thread Adrien Grand
The BM25 similarity computes the normalized length as the number of tokens, ignoring synonyms (tokens at the same position). Then it encodes this length as an 8-bit integer in the index using this logic: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFl

Re: Need approach to store JSON data in Lucene index

2021-06-17 Thread Adrien Grand
In general, the preferred approach is denormalizing, but your description suggests that you want to be able to query anything: actions, tasks, test cases, etc. so I guess that the most natural approach would be to leverage Lucene's support for index-time joins, see the documentation of the join pac

Re: Is deleting with IndexReader still possible?

2021-06-17 Thread Adrien Grand
Good catch Michael, removing from IndexReader has actually been removed a long time ago. I just edited the FAQ to correct this. On Thu, Jun 17, 2021 at 10:08 AM Michael Wechner wrote: > Hi > > According to the FAQ one can delete documents using the IndexReader > > > https://cwiki.apache.org/conf

Re: Handling Archive Data Using Lucene 7.6

2021-06-14 Thread Adrien Grand
Hi Rashmi, This upgrade skips 3 major versions, the simplest path will be to reindex your content. On Fri, Jun 11, 2021 at 10:40 AM Rashmi Bisanal wrote: > Hi Lucene Support Team , > > > > Objective : Upgrade Lucene 3.6 to 7.6 > > > > Description : We have huge data against version Lucene 3.6

Re: Potential bug

2021-06-14 Thread Adrien Grand
cene search around 20 hits i dont want > > >>>> thousands of hits. > > >>>> > > >>>> > > >>>> Best regards > > >>>> > > >>>> > > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOO

Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-10 Thread Adrien Grand
can get away with storing this data only once and using one of > the queries. > > On Wed, Jun 9, 2021 at 10:39 PM Adrien Grand wrote: > > > FWIW a related PR was just merged that allows to introspect query > > execution: https://issues.apache.org/jira/browse/LUCENE-9965. It&#x

Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-09 Thread Adrien Grand
FWIW a related PR was just merged that allows to introspect query execution: https://issues.apache.org/jira/browse/LUCENE-9965. It's different from your use-case though in that it is debugging information for a single query rather than statistical information across lots of user queries (and the ap

Re: Potential bug

2021-06-09 Thread Adrien Grand
Hi Baris, totalhitsThreshold is actually a minimum threshold, not a maximum threshold. The problem is that Lucene cannot directly identify the top matching documents for a given query. The strategy it adopts is to start collecting hits naively in doc ID order and to progressively raise the bar ab

Re: An interesting case

2021-06-08 Thread Adrien Grand
e count as scoredocs already > > has that. > > > > But seeing totalhits high number, that worries me as i explained above. > > > > > > Best regards > > > > > > On 6/8/21 1:12 PM, Adrien Grand wrote: > >> If you don't need any info

Re: An interesting case

2021-06-08 Thread Adrien Grand
array is ok as i mentioned ie, it has size n. > > i will check count api. > > > > Best regards > > > > *From:* Adrien Grand > > *Sent:* Tuesday, June 8, 2021 2:46 AM > > *To:* Lucene U

Re: An interesting case

2021-06-07 Thread Adrien Grand
When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the TopD

Re: Changing Term Vectors for Query

2021-06-07 Thread Adrien Grand
Hi Marcel, You can make Lucene index custom frequencies using something like DelimitedTermFrequencyTokenFilter , which would be easier than writing a custom Query/

Re: Performance decrease with NRT use-case in 8.8.x (coming from 8.3.0)

2021-05-19 Thread Adrien Grand
LUCENE-9115 certainly creates more files in the FSDirectory than in the ByteBuffersDirectory, e.g. stored fields are now always flushed to the FSDirectory since their size can't be known in advance, while they were always written to the ByteBuffersDirectory before (which was a big since these files

Re: How to ignore a match if a given keyword is before/after another given keyword?

2021-04-27 Thread Adrien Grand
Great to hear! Le mar. 27 avr. 2021 à 22:44, Jean Morissette a écrit : > Using intervals worked, thank you for your help ! > > On Sun, 25 Apr 2021 at 13:52, Adrien Grand wrote: > > > Hi Jean, > > > > You should be able to do this with intervals, see > >

Re: NullPointerException in LongComparator.setTopValue

2021-04-26 Thread Adrien Grand
E michael.gr...@skidata.com | www.skidata.com > > -Original Message- > From: Adrien Grand > Sent: Thursday, March 18, 2021 12:12 > To: Lucene Users Mailing List > Subject: Re: NullPointerException in LongComparator.setTopValue > > Hi Michael, > > At first sig

Re: How to ignore a match if a given keyword is before/after another given keyword?

2021-04-25 Thread Adrien Grand
Hi Jean, You should be able to do this with intervals, see https://lucene.apache.org/core/8_8_1/queries/org/apache/lucene/queries/intervals/package-summary.html . Le dim. 25 avr. 2021 à 18:43, Jean Morissette a écrit : > Thank you for your answer. > > The problem with this solution is that it e

Re: Backward compatibility of FST50 and UniformSplit formats

2021-04-19 Thread Adrien Grand
Hi Dmitry, These codecs are indeed not backward compatible. Only the default codec is guaranteed to be backward compatible. If you would like to bring your index to a snapshot of the main branch, one option would be to: 1. Use Lucene 8.5's IndexWriter#addIndexes in order to create a copy of your

  1   2   3   4   5   >