Re: Temporary vector file during merging

2025-07-22 Thread Viliam Ďurina
Hi all, some of the information above was incorrect. This is what happens: - the source "vec" files are indeed read twice, but for a different reason: once to calculate the checksum and once to copy the live vectors to the "vec_temp" file. - the "vec.tmp" file is then closed for writing and opened

Re: Temporary vector file during merging

2025-06-27 Thread Viliam Ďurina
I can confirm the temp file isn't renamed, but it's copied a second time. I'm on vacation next week. Dňa pi 27. 6. 2025, 21:24 Michael Sokolov napísal(a): > Right! Thanks for the pointer. It does seem like there is room for > improvement then, maybe Viliam wants to tackle it? > > On Fri, Jun 27,

Re: Temporary vector file during merging

2025-06-27 Thread Michael Sokolov
Right! Thanks for the pointer. It does seem like there is room for improvement then, maybe Viliam wants to tackle it? On Fri, Jun 27, 2025 at 12:57 PM Adrien Grand wrote: > > Mike, I believe that the answer to your question is in this PR review > comment: https://github.com/apache/lucene/pull/601

Re: Temporary vector file during merging

2025-06-27 Thread Adrien Grand
Mike, I believe that the answer to your question is in this PR review comment: https://github.com/apache/lucene/pull/601#discussion_r783711025. Merging is currently implemented by looping over fields once, and merging them. Writing the vec file first would require merging flat vectors for all fiel

Re: Temporary vector file during merging

2025-06-27 Thread Michael Sokolov
Without this temp file we would need to load the entire set of vectors for the new merged segment into RAM in order to support building an HNSW graph from it. This way we can read the vectors off the disk in the same way we would do during normal searches. I'm not sure, but I think the temp file s

Re: Point query on a LatLonPoint field

2025-06-09 Thread Ignacio Vera
I tried to move the API toward this direction ( https://github.com/apache/lucene/issues/10194) but I got pushed back. On Mon, Jun 9, 2025 at 8:22 PM Tomás Fernández Löbbe wrote: > Thanks a lot Ignacio, > This does seem to work. I'm wondering why this is not part of the query > processing itsel

Re: Point query on a LatLonPoint field

2025-06-09 Thread Tomás Fernández Löbbe
Thanks a lot Ignacio, This does seem to work. I'm wondering why this is not part of the query processing itself? Are there situations in which someone would not want this behavior? Tomas On Mon, Jun 9, 2025 at 11:02 AM Ignacio Vera wrote: > That is actually expected as the query is trying to ma

Re: Point query on a LatLonPoint field

2025-06-09 Thread Ignacio Vera
That is actually expected as the query is trying to match the original point with the encoded point in the index, therefore is not matching. There are other cases where results are not as expected, for example if you index the points from a polygon and then you make a polygon query using that polyg

Re: Sub-Graphs in Hnsw

2025-06-05 Thread Michael Sokolov
Oh, thanks for pointing that out, I hadn't seen the issue: I think it's roughly the same idea, we were discussing off-line (Kaival joined our office in Boston recently). Maybe let's move the discussion to that issue and iterate there On Thu, Jun 5, 2025 at 2:44 PM Michael Froh wrote: > > I'm wond

Re: Sub-Graphs in Hnsw

2025-06-05 Thread Michael Froh
I'm wondering if this is the same idea that Kaival is proposing in https://github.com/apache/lucene/issues/14758 (Support multiple HNSW graphs backed by the same vectors). On Thu, Jun 5, 2025 at 11:32 AM Michael Sokolov wrote: > I do think there could be many interesting use cases for building >

Re: Sub-Graphs in Hnsw

2025-06-05 Thread Michael Sokolov
I do think there could be many interesting use cases for building multiple graphs from a single set of vectors. For example, one might want to sometimes search all the docs, sometimes search the one subset and other times another subset; baking the constraint into the graph construction would be l

Re: Sub-Graphs in Hnsw

2025-06-04 Thread Ravikumar Govindarajan
> > I wonder if you could influence the graph search by incorporating the > partition key (customer id?) to the vectors somehow? If this was done > well it should lead to a natural clustering of the graph. > I can explore further on this. Thanks for the pointers.. On Mon, Jun 2, 2025 at 11:14 PM

Re: Sub-Graphs in Hnsw

2025-06-02 Thread Michael Sokolov
I wonder if you could influence the graph search by incorporating the partition key (customer id?) to the vectors somehow? If this was done well it should lead to a natural clustering of the graph. On Mon, Jun 2, 2025 at 11:32 AM Ravikumar Govindarajan wrote: > > Hi Michael, > > The docs range co

Re: Sub-Graphs in Hnsw

2025-06-02 Thread Ravikumar Govindarajan
Hi Michael, The docs range could vary in extremes from few 10s to tens-of-thousands and in very heavy usage cases, 100k and above… in a single segment Filtered Hnsw like you said uses a single graph.., which could be better if designed as sub-graphs On Mon, 2 Jun 2025 at 5:42 PM, Michael Sokolo

Re: Sub-Graphs in Hnsw

2025-06-02 Thread Michael Sokolov
How many documents do you anticipate in a typical sub range? If it's in the hundreds or even low thousands you would be better off without hnsw. Instead you can use a function score query based on the vector distance. For larger numbers where hnsw becomes useful, you could try using filtered hnsw,

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-05-30 Thread Michael Sokolov
> Regards > Rajib > > -Original Message- > From: Saha, Rajib > Sent: 27 May 2025 11:52 > To: java-user@lucene.apache.org > Subject: RE: Suggestion needed for a case of Lucene Migration with TokenStream > > Hi Uwe, > > Thanks for your suggestions till now. We have be

RE: Suggestion needed for a case of Lucene Migration with TokenStream

2025-05-29 Thread Saha, Rajib
Dear Experts, Can somebody please help and guide me for the below queries? I have become bit clueless now, after giving a good number of different tries. Regards Rajib -Original Message- From: Saha, Rajib Sent: 27 May 2025 11:52 To: java-user@lucene.apache.org Subject: RE: Suggestion

RE: Suggestion needed for a case of Lucene Migration with TokenStream

2025-05-26 Thread Saha, Rajib
IndexEngine.java:981) = Regards Rajib -Original Message- From: Uwe Schindler Sent: 30 April 2025 02:03 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream If this is Windows, the deletion may not wor

Re: Perf comparision for MMapDirectory Vs NIOFSDirectory

2025-05-23 Thread ashwini singh
Thanks! On Wed, 21 May 2025 at 23:35, Adrien Grand wrote: > Hello Ashwini, > > MMapDirectory will often perform a bit faster. While NIOFSDirectory needs > to first copy data from the buffer cache to heap arrays, MMapDirectory can > read directly into the buffer cache. > > Lucene's benchmark suit

Re: RAM-per-thread hard limit

2025-05-23 Thread Uwe Schindler
Hi, The segment size and this buffer parameter are unrelated to each other. Lucene builds smaller segments during index, but they are merged at a later stage anyways, so producing larger segments from the beginning and hitting limits like you see is not required for fast search. So raising th

Re: RAM-per-thread hard limit

2025-05-23 Thread Viliam Ďurina
My index was only vectors, plus a small string ID. That is probably the reason why it didn't hit any issue. When I added a larger text field to the document, I've hit this exception: Exception in thread "Thread-0" java.lang.RuntimeException: java.lang.ArithmeticException: integer overflow at com.d

Re: Perf comparision for MMapDirectory Vs NIOFSDirectory

2025-05-21 Thread Adrien Grand
Hello Ashwini, MMapDirectory will often perform a bit faster. While NIOFSDirectory needs to first copy data from the buffer cache to heap arrays, MMapDirectory can read directly into the buffer cache. Lucene's benchmark suite allows comparing these two directories. I haven't done so recently, but

Re: recommended index size

2025-05-15 Thread Anh Dũng Bùi
system. On Fri, Jan 5, 2024 at 3:51 Ralf Heyde wrote: > Hi Vincent, > > My 2 cents: > > We had a production environment with ~250g and ~1M docs with static + > dynamic fields in Solr (afair lucene 7) with a machine having 4GB for the > jvm and (afair) a little bit more mayb

Re: Regarding Clustering Support in Lucene

2025-05-14 Thread Arun Kumar Kalakanti
Dear all, My bad, KMeans is in 10.2 too. Are there any other clustering algos like DBSCAN (or HDBSCAN) or Agglomerative planned in future? Regards, Arun Kumar K On Tue, 6 May 2025 at 17:11, Arun Kumar Kalakanti wrote: > Dear all, > > Lucene 10.1 introduced "experimental" KMeans clustering of

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-13 Thread Uwe Schindler
Hi, you can try to further reduce the sharedArenaMaxPermits down to 1 (which restores the old behaviour). I don't know what Opensearch is doing with those many docvalues updates, but you're already very close at limits. If you need so many indexes in one machine and if you don't care of "slow

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-12 Thread Justin Borromeo
Hi Uwe, Setting -Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=64 didn't seem to help and we're still seeing restarts. A question about your response: what are normal update ratios? Each of our machines is running 32 OpenSearch shards (Lucene indexes), each with about 52 segments.

Re: Expressions module, support of Strings

2025-05-11 Thread David Smiley
> https://lists.apache.org/thread/xdjt7rlsmoy7bx7ctffxvsmh91khlz6v Thanks for the detailed response, Uwe! On Wed, May 7, 2025 at 2:55 PM David Smiley wrote: > I've been looking at the Expressions module, a really impressive piece of > work! > > The "JavaScript" sub-package only appears to suppo

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-09 Thread Uwe Schindler
Hi, Did the sharedArenaMaxPermits=64 help. Actually sorry for the answer, I did not recognize that you were talking about doc values updates. I just saw deleted. But basically the issue is the same: Every update or delete will create a new file belonging to same segment. As each segment by de

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-07 Thread Robert Muir
On Wed, May 7, 2025 at 3:48 PM Justin Borromeo wrote: > > > One thing I don't understand is why does the list of deleted mmapped > fields only include doc values files? If your theory is correct and this > is caused by deletes being updated over and over, wouldn't we expect only > .liv files to b

Re: Expressions module, support of Strings

2025-05-07 Thread Uwe Schindler
Hi, the expressions module is just made for calculating scores, therefore theres no need to call any function which takes strings. Also string bindings are not consumed, it only supports DoubleValues as function variables. If you want a full scripting language, check Elasticsearch Expression

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-07 Thread Justin Borromeo
Hi Uwe, Thanks for the response. We've tried setting sharedArenaMaxPermits to 64; I'll update this thread once we get some data. One thing I don't understand is why does the list of deleted mmapped fields only include doc values files? If your theory is correct and this is caused by deletes bei

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-07 Thread Uwe Schindler
Hi, this could be related to a bug or limitation of the following change: 1. GITHUB#13570 ,GITHUB#13574 ,GITHUB#13535 : Avoid performance degradation

Re: Crashes caused by high deleted .dvd file mmap counts

2025-05-06 Thread Ankit Jain
Hi Justin, This question is better for the OS community, as some of the settings are specific to OpenSearch. Will really appreciate, if you can create an OpenSearch issue . We can always follow up with the Lucene community, if it turns out

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-29 Thread Uwe Schindler
:59 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream Hi, what do you mean with: "But same content on rebuilding the index is not working"? How do you rebuild the index? It is not enough to just read all documents as stored

RE: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-29 Thread Saha, Rajib
Schindler Sent: 28 April 2025 17:59 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream Hi, what do you mean with: "But same content on rebuilding the index is not working"? How do you rebuild the index? It is not enough to jus

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-28 Thread Uwe Schindler
art now. Do you have any suggestion on the problem ? Regards Rajib -Original Message- From: Uwe Schindler Sent: 25 April 2025 18:19 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream Hi, I'd like to mention the following: Yo

RE: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-28 Thread Saha, Rajib
. I am debugging this part now. Do you have any suggestion on the problem ? Regards Rajib -Original Message- From: Uwe Schindler Sent: 25 April 2025 18:19 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream Hi, I'd li

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-25 Thread Uwe Schindler
5) = Can you please suggest here too? Regards Rajib -Original Message- From: Mikhail Khludnev Sent: 24 April 2025 12:10 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream Hi Use TextField.TYPE_STORED as the third ar

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-24 Thread Mikhail Khludnev
ludnev > Sent: 24 April 2025 12:10 > To: java-user@lucene.apache.org > Subject: Re: Suggestion needed for a case of Lucene Migration with > TokenStream > > Hi > Use TextField.TYPE_STORED as the third argument in new Field() > see > > https://github.com/apache/lucene-solr/blo

RE: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-24 Thread Saha, Rajib
) = Can you please suggest here too? Regards Rajib -Original Message- From: Mikhail Khludnev Sent: 24 April 2025 12:10 To: java-user@lucene.apache.org Subject: Re: Suggestion needed for a case of Lucene Migration with TokenStream Hi Use TextField.TYPE_STORED as the third

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-04-23 Thread Mikhail Khludnev
Hi Use TextField.TYPE_STORED as the third argument in new Field() see https://github.com/apache/lucene-solr/blob/e27f44e3d78dfcec230c97e0a1240e3751daeff9/lucene/core/src/java/org/apache/lucene/document/TextField.java#L35C33-L35C44 On Thu, Apr 24, 2025 at 8:37 AM Saha, Rajib wrote: > Hi Experts,

Re: Question about ImpactsDISI for boolean queries

2025-04-21 Thread Alfonsi, Peter
Hi Adrien, Thanks for the quick reply! This makes sense. I think BlockMaxConjunctionBulkScorer actually never calls setMinCompetitiveScore() at all, so there's no hope of skipping, while ConjunctionScorer does in the case that there's only one scorer (which happens when we move the range query

Re: Question about ImpactsDISI for boolean queries

2025-04-21 Thread Adrien Grand
You are on the right track. It's easier to skip by score when there is a single scoring clause than when the score is the sum of the scores of two clauses. Well, actually in this case two clauses are not much harder since one of the clauses gives the same score to all documents, but the conjunctio

Re: Synonyms and searching

2025-04-21 Thread Anh Dũng Bùi
[act as web server] generates: > > [work] > [act] > [like] > [as] > [internet] > [web] > [host] > [server] > > and the input: [act_as_web_server] generates: > > [work] > [act] > [act_] > [like] > [as] > [as_] > [internet] > [web] > [web_

Re: Does Lucene Vector Search support int8 and / or even binary?

2025-04-14 Thread Uwe Schindler
ave to quantisize on your own. Uwe Am 29.03.2024 um 08:28 schrieb Michael Wechner: thanks for your feedback and pointers! To play with binary vectors the following project might be useful https://github.com/cohere-ai/BinaryVectorDB Re Lucene, I will try to better understand what you suggest

Re: Does Lucene Vector Search support int8 and / or even binary?

2025-04-14 Thread John Dale (DB2DOM)
unsubscribe On Tue, Mar 19, 2024 at 2:59 PM Shubham Chaudhary wrote: > Hi Michael, > > Lucene already had int8 vector support since 9.5 (#1054 > ) but it was left to the user > to get those quantized vectors and index using KnnByteVectorField > < > htt

Re: Handling Nested Vector Field in the Filter Criteria

2025-04-09 Thread Michael Sokolov
You can combine queries; they are composable. Whether it makes sense or not for your use case is something you will have to decide. To me it's hard to see a case where vector query 1 AND vector query 2 would be preferable to combining the vectors "up front" (ie when creating the vectors), but mayb

Re: Handling Nested Vector Field in the Filter Criteria

2025-04-08 Thread Arun Kumar Kalakanti
Examples of the Nested Query: A AND (B OR C) AND D, A AND (!B OR (C and D)), etc. Any field(s) represented by A, B, C, and D can be a Vector or Regular Field too. On Tue, Apr 8, 2025 at 12:33 PM Arun Kumar Kalakanti < arun.kalaka...@gmail.com> wrote: > Hi all, > > I’m working with vector queries

Re: Synonym graph and multiple values

2025-03-25 Thread Michael Froh
This relates to the "position increment gap" for your analyzer and is configurable. If you check the JavaDoc for Analyzer#getPositionIncrementGap, it says: * Invoked before indexing a IndexableField instance if terms have already been added to that * field. This allows custom analyzers to p

RE: Synonyms and searching

2025-03-10 Thread Trevor Nicholls
phrases. But maybe my expectations are too high, or maybe I'm just doing it wrong. cheers T -Original Message- From: Uwe Schindler Sent: Monday, 10 March 2025 23:38 To: java-user@lucene.apache.org Subject: Re: Synonyms and searching Hi, Another way to do this is using Word Delimite

Re: Synonyms and searching

2025-03-10 Thread Uwe Schindler
Hi, Another way to do this is using Word Delimiter Filter and use "catenate" options. Be aware that you need special text tokenization (not use standard tokenizer, but instead WhitespaceTokenizer). This approach is common for product numbers. To no break you "normal" analysis, it is often a

Re: Synonyms and searching

2025-03-05 Thread Michael Sokolov
One thing to check is whether the synonyms are configured as bidirectional, or which direction they go (eg is "a b" being expanded to "ab" but "ab" is not being expanded to "a b"??) On Wed, Mar 5, 2025 at 2:20 PM Mikhail Khludnev wrote: > > Hello Trevor. > > Maintaining such a synonym map is too

Re: Synonyms and searching

2025-03-05 Thread Mikhail Khludnev
Hello Trevor. Maintaining such a synonym map is too much of a burden. One idea: sticks words together with "" separator with https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html Another idea, the opposite breaks user's words via dictionary htt

Re: NRT segment replication in AWS

2025-03-03 Thread Sarthak Nandi
> @Sarthak - I see the term pre-copy all over LuceneServer & nrtSearch but I > haven't been able to distinguish the term from just "copy". Does the "pre" > simply refer to the fact that the transfer of bits is happening before the > replica starts to serve queries from that segment? I feel like I

Re: NRT segment replication in AWS

2025-03-03 Thread Michael Froh
On Sun, Mar 2, 2025 at 7:21 AM Marc Davenport wrote: > > @Michael - That second simpler architecture is very similar to what we are > considering; With the exception of a queue for announcing new > segments rather than a polling process. It is good to know that it's a > reasonable outline. You

Re: NRT segment replication in AWS

2025-03-02 Thread Steven Schlansker
> > > latest > > > > > index as they spin up and subscribe to changes for it. It seems like > > > > having > > > > > the indexer being responsible for also communicating with the > > replicas > > > > > would be double dut

Re: NRT segment replication in AWS

2025-03-02 Thread Marc Davenport
ms. > > > > > > In our case, the primary keeps the latest CopyState that any replica > > > should take in memory. > > > Replicas call a HTTP api in an infinite loop, passing in their current > > > version, and asking if any newer version is available. &g

Re: How can I know the lucene index version from files

2025-03-02 Thread Mikhail Khludnev
I suppose it depends on the version. On Sun, Mar 2, 2025 at 10:55 AM Ralf Heyde wrote: > Hey, > > You might use ‚luke‘ to figure it out. > > Luke is part of the lucene project and a tool to look into indexes. > > Cheers Ralf > > Von meinem Telefon gesendet, etwaige Rechtschreibfehler kann ich ni

Re: How can I know the lucene index version from files

2025-03-02 Thread Daniel Cerqueira
> Von meinem Telefon gesendet, etwaige Rechtschreibfehler kann ich nicht > ausschliessen >> Am 02.03.2025 um 08:18 schrieb Mikhail Khludnev : >> >> Hi Daniel. >> Giving >Lucene41< my bet it's written by 4.1..4.9 version. >> Presumably you may get 4.9 (a decade old, heh) and invoke >> https://luc

Re: How can I know the lucene index version from files

2025-03-02 Thread Daniel Cerqueira
> On Sun, Mar 2, 2025 at 12:21 AM Daniel Cerqueira > wrote: > >> I have this lucene index files, in a directory: >> >> ``` >> $ ls >> _1p.fdt _1p.fdx _1p.fnm _1p_Lucene41_0.doc _1p_Lucene41_0.pos >> _1p_Lucene41_0.tim _1p_Lucene41_0.tip _1p.nvd _1p.nvm _1p.si >> segments_1 segments.gen w

Re: How can I know the lucene index version from files

2025-03-01 Thread Ralf Heyde
Hey, You might use ‚luke‘ to figure it out. Luke is part of the lucene project and a tool to look into indexes. Cheers Ralf Von meinem Telefon gesendet, etwaige Rechtschreibfehler kann ich nicht ausschliessen > Am 02.03.2025 um 08:18 schrieb Mikhail Khludnev : > > Hi Daniel. > Giving >Lucene

Re: How can I know the lucene index version from files

2025-03-01 Thread Mikhail Khludnev
Hi Daniel. Giving >Lucene41< my bet it's written by 4.1..4.9 version. Presumably you may get 4.9 (a decade old, heh) and invoke https://lucene.apache.org/core/4_9_0/demo/overview-summary.html#Searching_Files Or write a snippet of code, which opens a Directory\IndexReader and then print it to conso

Re: apache-lucene blowing up with large file

2025-03-01 Thread Dawid Weiss
The simple answer is - split your large text document into smaller documents, then use the same command but give it the folder where those smaller fragments are. This said, I think you should take a look at using the Java API directly, Daniel. You'll have a lot more control over how you index your

Re: apache-lucene blowing up with large file

2025-02-28 Thread Daniel Cerqueira
> On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira > wrote: > >> Hi. I have apache-lucene version 10.1.0: >> ``` >> $ pacman -Qs apache-lucene >> local/apache-lucene 10.1.0-1 >> Apache Lucene is a high-performance, full-featured text search engine >> library written entirely in Java. >> ``` >

Re: apache-lucene blowing up with large file

2025-02-28 Thread Hrvoje Lončar
That's a school example of integer overflow. Perhaps Lucene is not designed to work with such a large single files. On Fri, 28 Feb 2025, 10:50 Dawid Weiss, wrote: > Split your large file into smaller fragments and index each fragment as a > document. > > D. > > On Fri, Feb 28, 2025 at 10:30 AM D

Re: apache-lucene blowing up with large file

2025-02-28 Thread Dawid Weiss
Split your large file into smaller fragments and index each fragment as a document. D. On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira wrote: > Hi. I have apache-lucene version 10.1.0: > ``` > $ pacman -Qs apache-lucene > local/apache-lucene 10.1.0-1 > Apache Lucene is a high-performance,

Re: NRT segment replication in AWS

2025-02-26 Thread Sarthak Nandi
short of our http timeout, waiting for > > a new version to become available, otherwise return "no update for now, > > try again". > > > > Once the replica receives an updated CopyState, it feeds it into the > > ReplicaNode with newNRTPoint which starts the file

Re: NRT segment replication in AWS

2025-02-26 Thread Michael Froh
ode at any given time. > > Your idea of using a queue instead is interesting but not something we > extensively looked at :) > > > I've looked at nrtsearch from yelp and they seem to let the primary node > > have direct knowledge of the replicas. That makes sen

Re: NRT segment replication in AWS

2025-02-26 Thread Steven Schlansker
Candless's LuceneServer. > > I know that Amazon internally uses Lucene and has indexing separated from > query nodes and that they re-index and publish completely new indexes with > every release to prod. I've been watching what I can of the great videos of > Sokolov, McCa

Re: Query on SoftUpdateDocument API

2025-02-21 Thread Adrien Grand
Hi Abhishek, Actually softUpdate is about doing an update where the deletion is performed via a soft delete rather than a hard delete. To perform doc-value updates, you need to use the updateNumericDocValue or updateBinaryDocValue APIs. Note that it doesn't actually update in-place, it needs to

Re: Looking for resources to understand query cost/complexity

2025-02-21 Thread Adrien Grand
This depends on many factors, but in my experience these two are good starting points: - Total number of matching docs of the query. - Number of segments times number of terms being looked up. This is a simplified model, some queries incur their own costs, e.g. phrase queries bottleneck on evalu

RE: Re: Sentence classification with Lucene

2025-02-19 Thread Dmitri Geller
Yes, something like lucene-classification [1]. But, there are multiple classifiers in this package. Which one is better suited ? (Imagine I collect more samples per class... about... 30-40 samples per class) Any good Java examples using these classifiers? Another question: in case I want my cl

Re: Sentence classification with Lucene

2025-02-19 Thread Tommaso Teofili
Hi, if you have 30 classes with 10 samples per class, I'd say that's not an optimal distribution. Apart from that, you may use one of the text classifiers from lucene-classification [1], is anything like this what you had in mind? Alternatively you can also do things outside of Lucene and use Luce

Vector re-ranking

2025-02-11 Thread Viliam Ďurina
5.18%, which is surprisingly low. I noticed that the raw vector files (*.vec) are opened, but not read at all. So I tried searching for `numCandidates` documents (`n` and `k` were both 50`), and then re-ranked manually using the original vectors, and my recall for the quantized index rose to 81.70%,

Re: How to retrieve vectors from the IndexReader

2025-02-11 Thread Michael Sokolov
Stored fields is a separate format that stores data in a row-wise fashion: all the stored data for a single document is written together. Vectors aren't *also* copied into stored fields storage, so the stored fields API can't be used to retrieve them. If we did allow that it would result in massiv

Re: How to retrieve vectors from the IndexReader

2025-02-11 Thread Viliam Ďurina
Thanks Adrien! The code has one issue: if (iterator.advance(leafDocID) == docID) should have been: if (iterator.advance(leafDocID) == leafDocID) After fixing this, it works (for reference, I'm using Lucene 10.1). But I still wonder why can't we retrieve vectors just as we retrieve any oth

Re: How to retrieve vectors from the IndexReader

2025-02-10 Thread Adrien Grand
Hi Viliam, Your logic is mostly correct, here is a version that should be a bit simpler and correct (but beware, untested): IndexReader reader; // your multi-reader int docID; // top-level doc ID int readerID = ReaderUtil.subIndex(docID, reader.leaves()); LeafReaderContext leafContext = reader.le

Re: Suggestions for modeling an Index

2025-01-21 Thread Dawid Weiss
You could flatten the intervals into different documents. This would make retrieval of all of document's sectors a bit more clumsy but searching would be simpler and the number of fields would be constant. So each document would look like this: document_id: xyz sector_num: ... start: ... end: ...

Re: Suggestions for modeling an Index

2025-01-20 Thread Mikhail Khludnev
Hello Have you considered the range field https://lucene.apache.org/core/9_1_0/core/org/apache/lucene/document/IntRange.html ? On Mon, Jan 20, 2025 at 11:34 PM Cleber Muramoto wrote: > Hello. > > My model has the following Root structure, which consists of N > "TimeSpaceIntervals": > > { > id:

Re: Error Doc id doesn't match the query in vector searches

2025-01-17 Thread Varun Thacker
comes a AbstractKnnVectorQuery$DocAndScoreQuery >> >> I'll try looking with some fresh eyes tomorrow >> >> On Thu, Jan 16, 2025 at 6:00 PM Varun Thacker wrote: >> >>> I'll have to recreate my setup again since I tried re-building solr >>> witho

Re: Error Doc id doesn't match the query in vector searches

2025-01-17 Thread Varun Thacker
ith some fresh eyes tomorrow > > On Thu, Jan 16, 2025 at 6:00 PM Varun Thacker wrote: > >> I'll have to recreate my setup again since I tried re-building solr >> without some PRs and it wiped everything out(my mistake!) >> >> I was able to get the query Solr sen

Re: Error Doc id doesn't match the query in vector searches

2025-01-16 Thread Varun Thacker
Query I'll try looking with some fresh eyes tomorrow On Thu, Jan 16, 2025 at 6:00 PM Varun Thacker wrote: > I'll have to recreate my setup again since I tried re-building solr > without some PRs and it wiped everything out(my mistake!) > > I was able to get the

Re: Error Doc id doesn't match the query in vector searches

2025-01-16 Thread Varun Thacker
I'll have to recreate my setup again since I tried re-building solr without some PRs and it wiped everything out(my mistake!) I was able to get the query Solr sends for search KnnFloatVectorQuery vs what it uses for getting the score {AbstractKnnVectorQuery$DocAndScoreQuery. This might give

Re: Error Doc id doesn't match the query in vector searches

2025-01-16 Thread Varun Thacker
I have an index where I can repro it with 100% success. Let me look into what's causing it and create a Solr Jira On Mon, Oct 21, 2024 at 11:11 AM Michael Sokolov wrote: > I think this might be a better question for solr-user@? EG I don't > understand how Solr decides which Query to send to popu

Re: Reg Migration to 10.0.0 lucene core jar

2025-01-03 Thread Uwe Schindler
Hi, Which vulnerability are you talking about?!? We opened a CVE a while ago, but this was not about Lucene Core. Some checkers have false positives due to name mismatch. Am 13.12.2024 um 10:41 schrieb lavanya ponnapoolu: Hi Team, We are upgrading lucene-core jar from 4.7.0 to 10.0.0 beca

Re: Support for static analysis annotations

2025-01-03 Thread Uwe Schindler
Hi, we have not yet discussed about that. At moment Lucene uses one custom annotation "@SuppressForbidden") which is detected by the forbiddenapis plugin based on pure class name (not package). Forbiddenapis (https://github.com/policeman-tools/forbidden-apis) is a static analysis tool used ex

Re: Custom Query Implementation

2025-01-03 Thread Viacheslav Dobrynin
Hi, Thank you! пт, 3 янв. 2025 г. в 14:15, Uwe Schindler : > Hi, > > the expressions query should not be slower. Of course, if you also take > the compilation into the query time measurement it may be little slower > due to compilation and optimizing. In general queries should be warmed > before

Re: Custom Query Implementation

2025-01-03 Thread Uwe Schindler
Hi, the expressions query should not be slower. Of course, if you also take the compilation into the query time measurement it may be little slower due to compilation and optimizing. In general queries should be warmed before measuring them + expressions should only be compiled once and reuse

Re: IndexFormatTooOldException

2024-12-19 Thread Ian Lea
Thanks Adrien. Looks to be exactly what I need. -- Ian. On Thu, Dec 19, 2024 at 1:27 PM Adrien Grand wrote: > Hi Ian, > > Indeed Lucene has been maintaining read-only support for 8.x indices > lately, see this method which lets you opt in for this: > > https://lucene.apache.org/core/10_0_0/co

Re: IndexFormatTooOldException

2024-12-19 Thread Adrien Grand
Hi Ian, Indeed Lucene has been maintaining read-only support for 8.x indices lately, see this method which lets you opt in for this: https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/index/DirectoryReader.html#open(org.apache.lucene.index.IndexCommit,int,java.util.Comparator) . So if y

Re: Reg Migration to 10.0.0 lucene core jar

2024-12-14 Thread Mikhail Khludnev
Hello, org.apache.lucene.document.Field is there https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/document/Field.html or I don't understand what you refers to. Please elaborate. I think you need org.apache.lucene.store.FSDirectory#open(java.nio.file.Path) All jars should be the same ve

Re: Lucene Query Metrics

2024-12-04 Thread Mikhail Khludnev
Hello, There's nothing like that. On top of my head is a profile collector in Elasticsearch. On Wed, Dec 4, 2024 at 11:46 PM ashwini singh wrote: > Does lucene provide extensions (utilities)to extract metrics from Lucene > during the request execution? Or applications can only track execution >

Re: Lucene Query Metrics

2024-12-04 Thread ashwini singh
Does lucene provide extensions (utilities)to extract metrics from Lucene during the request execution? Or applications can only track execution stats on top of Lucene. On Tue, 3 Dec 2024 at 23:20, Adrien Grand wrote: > Lucene doesn't expose query metrics, it's up to the application that > integr

Re: Lucene Query Metrics

2024-12-03 Thread Adrien Grand
Lucene doesn't expose query metrics, it's up to the application that integrates Lucene to compute and expose metrics that are relevant to them. Le mer. 4 déc. 2024, 00:31, ashwini singh a écrit : > Hey everyone, > > Does lucene provide any query metrics (perf) ? I am looking for something > very

Re: Lucene Slack Channel

2024-12-03 Thread ashwini singh
Thanks !! On Wed, 13 Nov 2024 at 13:31, Gus Heck wrote: > The slack channel (named 'lucene-dev') is generally for people building > lucene itself, and not generally for people looking for help providing > solutions using lucene. Typically one gets an apache.org address by > contributing enough

Re: Custom Query Implementation

2024-12-03 Thread Viacheslav Dobrynin
Hi, Thanks for the answers! Yes, my task is to store only non-zero values from a sparse vector of large dimension, where most of the elements are zero. вт, 3 дек. 2024 г. в 19:17, Mikhail Khludnev : > Thanks for clarification Michael! > > On Tue, Dec 3, 2024 at 1:56 PM Michael Sokolov wrote: >

Re: Custom Query Implementation

2024-12-03 Thread Mikhail Khludnev
Thanks for clarification Michael! On Tue, Dec 3, 2024 at 1:56 PM Michael Sokolov wrote: > Sparse is meaning two different things here. In the case you found Mikhail, > it means not every document has a value for some vector field. I think the > question here is about very high dimensional vector

Re: Custom Query Implementation

2024-12-03 Thread Michael Sokolov
Sparse is meaning two different things here. In the case you found Mikhail, it means not every document has a value for some vector field. I think the question here is about very high dimensional vectors where most documents have zeroes in most dimensions of the vector. On Tue, Dec 3, 2024, 2:01 A

Re: Custom Query Implementation

2024-12-02 Thread Mikhail Khludnev
Morning. I noticed a condition choosing sparse and dense format underneath https://github.com/apache/lucene/blob/6053e1e31378378f6d310a05ea6d7dcdfc45f48b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java#L108 perhaps it may achieve your performance requirements.

Re: Custom Query Implementation

2024-12-02 Thread Viacheslav Dobrynin
Hi, Thanks for the answer! I think this is similar to my initial implementation, where I built the query as follows (PyLucene): def build_query(query): builder = BooleanQuery.Builder() for term in torch.nonzero(query): field_name = to_field_name(term.item()) value = query[

  1   2   3   4   5   6   7   8   9   10   >