Re: Temporary vector file during merging

2025-06-27 Thread Michael Sokolov
is sounds doable, but we never got to it. > > > > On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov wrote: > > > Without this temp file we would need to load the entire set of vectors > > for the new merged segment into RAM in order to support building an > > HNSW gra

Re: Temporary vector file during merging

2025-06-27 Thread Michael Sokolov
Without this temp file we would need to load the entire set of vectors for the new merged segment into RAM in order to support building an HNSW graph from it. This way we can read the vectors off the disk in the same way we would do during normal searches. I'm not sure, but I think the temp file s

Re: Sub-Graphs in Hnsw

2025-06-05 Thread Michael Sokolov
wrote: > > I'm wondering if this is the same idea that Kaival is proposing in > https://github.com/apache/lucene/issues/14758 (Support multiple HNSW graphs > backed by the same vectors). > > On Thu, Jun 5, 2025 at 11:32 AM Michael Sokolov wrote: > > > I do think there c

Re: Sub-Graphs in Hnsw

2025-06-05 Thread Michael Sokolov
key (customer id?) to the vectors somehow? If this was done > > well it should lead to a natural clustering of the graph. > > > > I can explore further on this. Thanks for the pointers.. > > On Mon, Jun 2, 2025 at 11:14 PM Michael Sokolov wrote: > > > I wonder i

Re: Sub-Graphs in Hnsw

2025-06-02 Thread Michael Sokolov
e docs range could vary in extremes from few 10s to tens-of-thousands > and in very heavy usage cases, 100k and above… in a single segment > > Filtered Hnsw like you said uses a single graph.., which could be better if > designed as sub-graphs > > On Mon, 2 Jun 2025 at 5:42 PM, Mic

Re: Sub-Graphs in Hnsw

2025-06-02 Thread Michael Sokolov
How many documents do you anticipate in a typical sub range? If it's in the hundreds or even low thousands you would be better off without hnsw. Instead you can use a function score query based on the vector distance. For larger numbers where hnsw becomes useful, you could try using filtered hnsw,

Re: Suggestion needed for a case of Lucene Migration with TokenStream

2025-05-30 Thread Michael Sokolov
The message is telling you that you previously indexed the field boe.search.wild_description with offsets and now you are trying to index it without offsets. This probably indicates you are using a different Analyzer, which is generally *not ok* since indexed fields must be indexed in a consistent

Re: Handling Nested Vector Field in the Filter Criteria

2025-04-09 Thread Michael Sokolov
You can combine queries; they are composable. Whether it makes sense or not for your use case is something you will have to decide. To me it's hard to see a case where vector query 1 AND vector query 2 would be preferable to combining the vectors "up front" (ie when creating the vectors), but mayb

Re: Synonyms and searching

2025-03-05 Thread Michael Sokolov
One thing to check is whether the synonyms are configured as bidirectional, or which direction they go (eg is "a b" being expanded to "ab" but "ab" is not being expanded to "a b"??) On Wed, Mar 5, 2025 at 2:20 PM Mikhail Khludnev wrote: > > Hello Trevor. > > Maintaining such a synonym map is too

Re: How to retrieve vectors from the IndexReader

2025-02-11 Thread Michael Sokolov
Stored fields is a separate format that stores data in a row-wise fashion: all the stored data for a single document is written together. Vectors aren't *also* copied into stored fields storage, so the stored fields API can't be used to retrieve them. If we did allow that it would result in massiv

Re: Custom Query Implementation

2024-12-03 Thread Michael Sokolov
Sparse is meaning two different things here. In the case you found Mikhail, it means not every document has a value for some vector field. I think the question here is about very high dimensional vectors where most documents have zeroes in most dimensions of the vector. On Tue, Dec 3, 2024, 2:01 A

Re: Custom Query Implementation

2024-12-02 Thread Michael Sokolov
Another way is using postings - you can represent each dimension as a term (`dim0`, `dim1`, etc) and index those that occur in a document. To encode a value for a dimension you can either provide a custom term frequency, or index the term multiple times. Then when searching you can form a BooleanQu

Re: HNSW graph `connectComponents()` method takes a very long on random vectors

2024-12-01 Thread Michael Sokolov
That's interesting! One thing I'd say is we don't want to be optimizing for the random vector use case, so from that perspective this is less concerning. However we also don't want to have poor worst-case performance, so we should address this somehow. If you want to probe for degenerate cases, yo

Re: Any plans to patch Lucene 8.11.x for CVE-2024-45772 ?

2024-10-28 Thread Michael Sokolov
Do you actually use org.apache.lucene.replicator.http ? If not then this wouldn't have any material impact on your application. On Mon, Oct 28, 2024 at 4:25 AM Renaud SAINT-GRATIEN wrote: > > CONFIDENTIAL > > Hello, > > Is there any plan to patch Lucene 8.11 for CVE-2024-45772 ? > I need to stay

Re: Error Doc id doesn't match the query in vector searches

2024-10-21 Thread Michael Sokolov
I think this might be a better question for solr-user@? EG I don't understand how Solr decides which Query to send to populateScores -- is it the same one that was used to generate the matches in topDocs? It seems as if it should be, but then this error shouldn't happen ... I wonder if you can prin

Re: KnnQueries and result discrepancy between indexes with the same data

2024-09-12 Thread Michael Sokolov
> If your two indexes load data sequentially and in the same order, then I believe that you would get the same results. But we consider this an implementation detail rather than a guarantee that Lucene should have. You might even still be surprised by nondeterminism arising from concurrency during

Re: Get knowledge about apache lucene index migrate

2024-08-06 Thread Michael Sokolov
Yes, there is no support for upgrading a pre-8.x index to 9 or later. At some point it was decided that supporting that would lead to grief for users and/or hamper development of Lucene, so now you can only upgrade one major version. If you need to do so, the best supported option is to write a pro

Re: Converting docid to uid

2024-08-06 Thread Michael Sokolov
You could switch to DocValues, and it would probably be more efficient if you are only retrieving a single stored field but you have a lot of other ones in the index since stored fields are stored together and have to be decoded together. As far as visiting every segment on disk I'm not sure what

Re: Lucas - Luke toolbox integration for IntelliJ

2024-06-06 Thread Michael Sokolov
Neat! On Thu, Jun 6, 2024, 2:57 AM Balog Tamás wrote: > Dear Lucene Community, > Since Tuesday, the IntelliJ plugin called [Lucas]( > https://plugins.jetbrains.com/plugin/24567-lucas) is available on the > JetBrains Marketplace. > > It integrates / ports the Luke toolbox to the IntelliJ Platform

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Michael Sokolov
th a code > > search). > > We also always merge down to one segment (historical but also we index > > once and then there are no changes for a week to a month and then we > > reindex every document from scratch). > > > > Your response is very helpful already and

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Michael Sokolov
It seems as if the term frequency for some term exceeded the maximum. This can happen if you supplied custom term frequencies eg with https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true . The behavior didn't change since

Re: Help running the demo program

2024-04-22 Thread Michael Sokolov
I also found this helpful documentation by looking in the source code of SearchFiles.java: https://lucene.apache.org/core/9_10_0/demo/ On Mon, Apr 22, 2024 at 4:40 AM Stefan Vodita wrote: > > Hi Siddharth, > > If you happen to be using IntelliJ, you can run a demo class from the IDE. > It probabl

Re: hnsw parameters for vector search

2024-02-01 Thread Michael Sokolov
To get best results it's necessary to tune these parameters for each vector model. My suggestion is to use a subset of your 100M vectors for parameter optimization to save time while iterating through the parameters space as you will indeed need to reindex in order to measure Generally speaking, i

Re: DisjunctionMinQuery

2023-11-10 Thread Michael Sokolov
> In Lucene scores should go up for more relevancy. That is the case for combining child scores with min. min() is monotonic -- if its arguments increase, the result does not decrease, it only stays the same or increases, so I think it is a valid scoring operation for Lucene. And it makes some log

Re: Can the BooleanQuery execution be optimized with same term queries?

2023-09-19 Thread Michael Sokolov
another thing to check beyond whether the correct documents are matched is whether the correct score is returned. I'm not sure actually how it works but I can imagine that a query for "red red wine" would produce a higher score for documents having "red red wine" than it would for documents having

Re: Top docs depend on value of K nearest neighbour

2023-08-03 Thread Michael Sokolov
well, it is "approximate" KNN and can get caught in local minima (maxima?). Increasing K has, indirectly, the effect of expanding the search space because the minimum score in the priority score (score of the Kth item) is used as a threshold for deciding when to terminate the search On Wed, Aug 2,

Re: Proposal to Reimplement Disk Usage API - Request for Feedback and Collaboration

2023-05-26 Thread Michael Sokolov
Hi Deepika, that would be a welcome addition - we had an earlier discussion about it; see the thread here: https://markmail.org/message/hq7jvobsnxwp7iat Please be careful not to copy the code from Elastic as it is not shared under an open license that permits copying On Wed, May 24, 2023 at 3:19 

Re: Can I simplify this bit of query boosting?

2023-05-11 Thread Michael Sokolov
You might also want to have a look at FeatureField. This can be used to associate a score with a particular term. On Thu, May 11, 2023 at 1:13 PM Hrvoje Lončar wrote: > > I had a situation when i wanted to sort a list of articles based on the > amount of data entered. For example, article having

Re: Question about index segment search order

2023-05-11 Thread Michael Sokolov
3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L281 > > Thanks, > Wei > > > On Thu, May 4, 2023 at 11:47 AM Michael Sokolov wrote: > > > Yes, sorry I didn't mean to imply you couldn't control this if you &

Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
with early termination. Do you think this makes sense? Any > > suggestion is appreciated. > > > > Thanks, > > Wei > > > > On Thu, May 4, 2023 at 3:33 AM Michael Sokolov wrote: > > > > > There is no meaning to the sequence. The segments are created > &

Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
There is no meaning to the sequence. The segments are created concurrently by many threads and the merge process will merge them without regards to any ordering. On Wed, May 3, 2023, 1:09 PM Patrick Zhai wrote: > For that part I'm not entirely sure, if other folks know it please chime in > :)

Re: Info required on licensing of Lucene component

2023-03-21 Thread Michael Sokolov
Lucene is licensed under the Apache license, just as it says in the LICENSE file. junit is used for testing Lucene and is not redistributed with it. Using Lucene in your code does not mean you are using junit, except in some extremely philosophical sense. EG Lucene developers may have developed Luc

Re: How to highlight fields that are not stored?

2023-02-16 Thread Michael Sokolov
Sorry your problem statement makes no sense: you should be able to store field data in the index without loading all your documents into RAM while indexing. Maybe there is some constraint you are not telling us about? Or you may be confused. In any case highlighting requires the document in its uni

Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-15 Thread Michael Sokolov
I would suggest building Lucene from source and adding your own similarity function to VectorSimilarity. That is the proper extension point for similarity functions. If you find there is some substantial benefit, it wouldn't be a big lift to add something like that. However I'm dubious about the li

Re: Question about current situation of good first issues in GitHub

2023-01-13 Thread Michael Sokolov
That label seems to be something GitHub created automatically? You might have better luck browsing the full list of labels. I found these: https://github.com/apache/lucene/labels/legacy-jira-label%3Anewbie https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev https://github.com/apach

Re: Is there a way to customize segment names?

2022-12-16 Thread Michael Sokolov
+1 trying to coordinate multiple writers running independently will not work. My 2c for availability: you can have a single primary active writer with a backup one waiting, receiving all the segments from the primary. Then if the primary goes down, the secondary one has the most recent commit repli

Re: Lucene 4.10.4 forward slash syntax error

2022-11-28 Thread Michael Sokolov
Have you tried escaping with a backslash? I have a vague memory that might work. As for modifying classes in 4.10.4, you are welcome to do so in a custom fork, but that version is so old that we no longer post fixes for it on the official Apache release branches. The current release series is 9.x -

Re: Best strategy migrate indexes

2022-11-07 Thread Michael Sokolov
The error you got BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))): 9 (needs to be between 6 and 7) indicates that the index you are reading was written by Lucene 9, so things are not set up the way you described (writing using Lucene 7) > Thanks TX

Re: Latency and recall re HSWN: Lucene versus Vespa

2022-10-01 Thread Michael Sokolov
I'd agree with the main point re: the need to combine vector-based matching with term-based matching. As for the comparison with Lucene, I'd say it's a shallow and biased take. The main argument is that Vespa's mutable in-memory(?) data structures are superior to Lucene's immutable on-disk segment

[ANNOUNCE] Apache Lucene 9.4.0 released

2022-09-30 Thread Michael Sokolov
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-n

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Michael Sokolov
I think it depends how precise you want to make the search. If you want to enable diacritic-sensitive search in order to avoid confusions when users actually are able to enter the diacritics, you can index both ways (ascii-folded and not folded) and not normalize the query terms. Or you can just fo

Re: Max Field Length

2022-09-23 Thread Michael Sokolov
ooh On Fri, Sep 23, 2022 at 11:02 AM Adrien Grand wrote: > > We have a TruncateTokenFilter in lucene/analysis/common. :) > > On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov wrote: > > > I wonder if it would make sense to provide a TruncationFilter in > > addition to

Re: Max Field Length

2022-09-23 Thread Michael Sokolov
I wonder if it would make sense to provide a TruncationFilter in addition to the LengthFilter. That way long tokens in source text could be better supported, albeit with some confusion if they share the same very long prefix... On Fri, Sep 23, 2022 at 9:56 AM Scott Guthery wrote: > > Thanks much,

Re: Can lucene be used in Android ?

2022-09-09 Thread Michael Sokolov
no, and I think it could be challenging to go the route of using Dalvik/ART. Maybe you can run an actual JDK on Android? See https://openjdk.org/projects/mobile/android.html On Fri, Sep 9, 2022 at 9:27 AM Jie Wang wrote: > > Hey, > > Recently, I am trying to compile the Lucene to get a jar that c

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-24 Thread Michael Sokolov
Thanks! It seems to be working nicely. Question about the fix-version: tagging. I wonder if going forward we want to main that for new issues? I happened to notice there is also this "milestone" feature in github -- does that seem like a place to put version information? On Wed, Aug 24, 2022 at 3

Re: Performance Comparison of Benchmarks by using Lucene 9.1.0 vs 8.5.1

2022-07-26 Thread Michael Sokolov
https://home.apache.org/~mikemccand/lucenebench/ shows how various benchmarks have evolved over time *on the main branch*. There is no direct comparison of every version against every other version that I have seen though. On Tue, Jul 26, 2022 at 2:12 PM Baris Kazar wrote: > > Dear Folks,- > Sim

Re: Fuzzy Query Similarity

2022-07-09 Thread Michael Sokolov
Oh good! Thanks for clarifying, Uwe On Sat, Jul 9, 2022, 12:23 PM Uwe Schindler wrote: > Hi > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact > > matches, or even to incorporate the edit distance more generally into > > the per-term score, although it does seem like that wou

Re: Fuzzy Query Similarity

2022-07-09 Thread Michael Sokolov
I am no expert with this, but I got curious and looked at FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact matches, or even to incorporate the edit distance more generally into the per-term score, although it does seem like that would be something people would generally expect. So

Re: Version of log4j in Lucene 8.11.2

2022-06-23 Thread Michael Sokolov
Lucene core is a no-dependencies library. Some of the other Lucene modules, and the build and tests, have dependencies, but none of them includes log4j. So sorry, but we won't be making Lucene use log4j 2.17.2; probably you should get your compliance standards changed to include *forbidden* version

Re: Question about Benchmark

2022-05-17 Thread Michael Sokolov
OK I replied on the issue. This ann-benchmarks is a separate project, and I think you are asking about how to change it. Probably should take it up with erikbern or whatever community is supporting that actively. I just created a "plugin" so we could use it to test Lucene's KNN implementation, but

Re: New user questions about demo, downloads, and IRC

2022-04-26 Thread Michael Sokolov
thanks, I fixed the doc! On Tue, Apr 26, 2022 at 9:13 AM Bridger Dyson-Smith wrote: > > Hi Michael - > > On Mon, Apr 25, 2022 at 5:38 PM Michael Wechner > wrote: > > > Hi Bridger > > > > Inside > > > > https://dlcdn.apache.org/lucene/java/9.1.0/lucene-9.1.0.tgz > > > > you should find > > > > mo

Re: RangeFacetsCount Question

2022-04-26 Thread Michael Sokolov
Looking at git blame I see the current parameter was added here: https://issues.apache.org/jira/browse/LUCENE-6648. Previous implementations supported a BitSet rather than a Query. I'm not really sure what the use case is for applying additional filtering when faceting. Perhaps it can support somet

Re: Returning large resultset is slow and resource intensive

2022-03-08 Thread Michael Sokolov
Another approach for retrieving large result sets can work if you have a unique sort key. and don't mind retrieving your results sorted by this key. Then you can retrieve the results in batches using a cursor-style approach; request the top N sorted by the key. Then request the top N s.t. the key i

Re: Issue with Japanese User Dictionary

2022-01-13 Thread Michael Sokolov
HI Marc, I wonder if there is a workaround for this issue: eg, could we have entries for both widths? I wonder if there is some interaction with an analysis chain that is doing half-width -> full-width conversion (or vice versa)? I think the UserDictionary has to operate on pre-analyzed tokens ...

Re: Moving from lucene 6.x to 8.x

2022-01-13 Thread Michael Sokolov
I think the "broken offsets" refers to offsets of tokens "going backwards". Offsets are attributes of tokens that refer back to their byte position in the original indexed text. Going backwards means -- a token with a greater position (in the sequence of tokens, or token graph) should not have a le

Re: Lucene 9.0.0 inconsistent index options

2021-12-14 Thread Michael Sokolov
Strictly speaking, we could have opened an older index using Lucene 8 (say one that was created using Lucene 7, or 6) that would no longer be valid in Lucene 9, at least according to the policy? I agree we should try to fix this, just want to clarify the policy On Tue, Dec 14, 2021 at 8:54 AM Adri

Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael Sokolov
I wonder if the Analysis chain could be involved. If those stop words ("is") are removed without leaving a hole somehow, then that could explain? On Mon, Dec 13, 2021 at 9:35 AM Michael McCandless wrote: > > Hello Claude, > > Hmm, that is interesting that you see slop=2 matching query "quick fox"

Re: How to change sorting *after* getting search results

2021-11-30 Thread Michael Sokolov
I think you are asking how to re-sort a result set returned from IndexSearcher.search, ie a TopDocs? You can do this with one of the various Rescorers. Have you looked at those? On Tue, Nov 30, 2021, 9:15 AM Luís Filipe Nassif wrote: > Hi Lucene community, > > Our users could do very heavy searc

Re: Java 17 and Lucene

2021-10-26 Thread Michael Sokolov
x27;ll try to reproduce the hang first and then try to get the JVM logs. > > > I'll > > > > respond back here if I find something useful. > > > > > > > > > Do you get this error in lucene:core:ecjLintMain and not during > > > compile? >

Re: Java 17 and Lucene

2021-10-20 Thread Michael Sokolov
ure as well? > > Thanks again! > Kevin > > On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov wrote: > > > > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU > > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and > >

Re: Java 17 and Lucene

2021-10-19 Thread Michael Sokolov
> I would a bit careful: On our Jenkins server running with AMD Ryzen CPU it > happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and stay > unkillable (only a hard kill with" kill -9"). Previous Java versions don't > hang. It happens not all the time (about 1/4th of all builds

Re: Using setIndexSort on a binary field

2021-10-17 Thread Michael Sokolov
Yeah, index sorting doesn't do that -- it sorts *within* each segment so that when documents are iterated (within that segment) by any of the many DocIdSetIterators that underlie the Lucene search API, they are retrieved in the order specified (which is then also docid order). To achieve what you

Re: Search while typing (incremental search)

2021-10-08 Thread Michael Sokolov
Thank you for offering to add to the FAQ! Indeed it should mention the suggester capability. I think you have permissions to edit that wiki? Please go ahead and I think add a link to the suggest module javadocs On Thu, Oct 7, 2021 at 2:30 AM Michael Wechner wrote: > > Thanks very much for your fe

Re: Querying into a Collector visits documents multiple times

2021-09-24 Thread Michael Sokolov
Ah sorry never mind. Confused collector and collector manager On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov wrote: > Separate issue, but this collector is not going to work with concurrent > search since the sum is not updated in a thread safe manner. Maybe you > don't care, since

Re: Querying into a Collector visits documents multiple times

2021-09-24 Thread Michael Sokolov
Separate issue, but this collector is not going to work with concurrent search since the sum is not updated in a thread safe manner. Maybe you don't care, since you don't use a thread pool to execute your queries, but you probably should! On Wed, Sep 22, 2021, 8:38 AM Adrien Grand wrote: > Hi St

Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Michael Sokolov
query, and rely on log(a)+log(b) = log(a * b). > > Le ven. 17 sept. 2021 à 14:47, Michael Sokolov a > écrit : > > > Not advocating any particular approach here, just curious: could BMW > > also function in the presence of a doc-score (like recency) that is > > multi

Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Michael Sokolov
Not advocating any particular approach here, just curious: could BMW also function in the presence of a doc-score (like recency) that is multiplied? My vague understanding is that as long as the scoring formula is monotonic in all of its inputs, and we have block-encoded the inputs, then we could c

Re: currency based search using query time calculated field match with expression

2021-09-03 Thread Michael Sokolov
nt hits(doc hits > matching 2 USD with 150 INR records). Any pointers to know about this in > detail? > > > Kumaran R > Chennai, India > > > > On Fri, Sep 3, 2021 at 12:08 AM Michael Sokolov wrote: > > > Have you looked at the expressions module? It pr

Re: currency based search using query time calculated field match with expression

2021-09-02 Thread Michael Sokolov
Have you looked at the expressions module? It provides support for user-defined computation using values from the index based on a simple expression language. It might prove useful to you if the exchange rate needs to be tracked very dynamically. On Thu, Sep 2, 2021 at 2:15 PM Kumaran Ramasubraman

Re: Lucene cpu utilization & scoring

2021-08-20 Thread Michael Sokolov
I think the usual usage pattern is to *refresh* frequently and commit less frequently. Is there a reason you need to commit often? You may also have overlooked this newish method: MergePolicy.findFullFlushMerges If you implement that, you can tell IndexWriter to (for example) merge multiple small

Re: Index backwards compatibility

2021-05-27 Thread Michael Sokolov
... should *reindex* ( not update ) On Thu, May 27, 2021 at 10:39 AM Michael Sokolov wrote: > > LGTM, but perhaps also should state that if possible you *should* > update because the 8.x index may not be able to be read by the > eventual 10 release. > > On Thu, May 27, 2021

Re: Index backwards compatibility

2021-05-27 Thread Michael Sokolov
work :-) > > > > Thank you very much! > > > > But IIUC it is recommended to reindex when upgrading, right? I guess > > similar to what Solr is recommending > > > > https://solr.apache.org/guide/8_0/reindexing.html > > > > > > Am 26.05.21

Re: Lucene/Solr and BERT

2021-05-26 Thread Michael Sokolov
This java implementation will be slower than the C implementation. I believe the algorithm is essentially the same, however this is new and there may be bugs! I (and I think Julie had similar results IIRC) measured something like 8x slower than hnswlib (using ann-benchmarks). It is also surprising

Re: Index backwards compatibility

2021-05-26 Thread Michael Sokolov
I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0 to read 8.x indexes. On Wed, May 26, 2021 at 9:27 AM Michael Wechner wrote: > > Hi > > I am using Lucene 8.8.2 in production and I am currently doing some > tests using 9.0.0-SNAPSHOT, whereas I have included > lucene-backward-

Re: Lucene/Solr and BERT

2021-05-23 Thread Michael Sokolov
Hi Michael, that is fully-functional in the sense that Lucene will build an HNSW graph for a vector-valued field and you can then use the VectorReader.search method to do KNN-based search. Next steps may include some integration with lexical, inverted-index type search so that you can retrieve N-cl

Re: Lucene Explanation

2021-04-12 Thread Michael Sokolov
You might want to check out https://issues.apache.org/jira/browse/LUCENE-8019 where I tried to implement some debugging utilities on top of Explain. It never got committed, but it does explore some of the challenges around introducing a more structured explain response. On Fri, Apr 9, 2021 at 6:40

Re: Search results/criteria validation

2021-03-17 Thread Michael Sokolov
See https://issues.apache.org/jira/browse/LUCENE-9640 On Wed, Mar 17, 2021 at 4:02 PM Paul Libbrecht wrote: > > Explain is a heavyweight thing. Maybe it helps you, maybe you need > something high-performance. > > I was asking a similar question ~10 years ago and got a very interesting > answer on

Re: Lucene Migration query

2020-11-20 Thread Michael Sokolov
s a version stamp X-2 or > older. > > Best, > Erick > > > On Nov 20, 2020, at 7:57 AM, Michael Sokolov wrote: > > > > I think running the upgrade tool would also be necessary to set you up for > > the next upgrade, when 9.0 comes along. > > > > O

Re: Lucene Migration query

2020-11-20 Thread Michael Sokolov
I think running the upgrade tool would also be necessary to set you up for the next upgrade, when 9.0 comes along. On Fri, Nov 20, 2020, 4:25 AM Uwe Schindler wrote: > Hi, > > > Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1. > Should > > I do reindexing in this case ? > > No

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Michael Sokolov
You can't directly compare disk usage across two indexes, even with the same data. Try re-indexing one of your datasets, and you will see that the disk size is not the same. Mostly this is due to the way segments are merged varying with some randomness from one run to another, although the size of

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-04 Thread Michael Sokolov
A1, D, A2 (binding) On Fri, Sep 4, 2020 at 12:46 AM David Smiley wrote: > > (binding) > vote: D, A1 > > > (thanks Ryan for your thorough vote instructions & preparation) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apa

Re: Simultaneous Indexing and searching

2020-09-01 Thread Michael Sokolov
So ... this is a fairly complex topic I can't really cover it in depth here; how to architect a distributed search engine service. Most people opt to use Solr or Elasticsearch since they solve that problem for you. Those systems work best when the indexes are local to the service that is accessing

Re: [VOTE] Lucene logo contest, here we go again

2020-09-01 Thread Michael Sokolov
A1, binding On Mon, Aug 31, 2020 at 8:26 PM Ryan Ernst wrote: > > Dear Lucene and Solr developers! > > In February a contest was started to design a new logo for Lucene > [jira-issue]. The initial attempt [first-vote] to call a vote resulted in > some confusion on the rules, as well the request

Re: Hierarchical facet select a subtree but one child

2020-08-15 Thread Michael Sokolov
If you are trying to show documents that have facet value V1 excluding those with facet value V1.1, then you would need to issue a query like: +f:V1 -f:V1.1 assuming your facet values are indexed in a field called "f". I don't think this really has anything to do with faceting; it's just a fi

Re: ANN search current state

2020-07-16 Thread Michael Sokolov
We have some prototype implementations in the issues you found. If you want to try out the approaches in those issues, you could build Lucene from source and patch it, but there is no release containing KNN/vector support. We're still working to establish consensus on what the best way forward is.

Re: About custom score using Solr8/Lucene8

2020-07-08 Thread Michael Sokolov
; stateful or has to store an state that should be available later. > Or, on the other hand, understand if there is an order in the methods calls > (first getValues then needsScores, first advanceExact then doubleValue). > Don't you agree? > > > On Mon, Jul 6, 2020 at 4:5

Re: About custom score using Solr8/Lucene8

2020-07-06 Thread Michael Sokolov
I found that when there is explicit code many implementations returns > directly: false. > > What does this mean? why and when should I return true or false? > > > On Mon, Jul 6, 2020 at 2:50 PM Michael Sokolov wrote: > > > Did you read the DoubleValuesSourc

Re: About custom score using Solr8/Lucene8

2020-07-06 Thread Michael Sokolov
Did you read the DoubleValuesSource javadocs, and find they weren't enough? On Sun, Jul 5, 2020 at 7:54 AM Vincenzo D'Amore wrote: > > Hi all, > > Finally I have a custom DoubleValuesSource that gives the expected results, > but I'm a little worried about the lack of documentation. > > When you e

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Michael Sokolov
s at capacity, I just return 0 for any docs that had a > > boolean query score smaller than the min in the queue. > > > > But you can actually forget entirely that this ScoreFunction exists. It > > only contributes ~6% of the runtime. > > Even if I only use the Boole

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Michael Sokolov
You might consider using a TermInSetQuery in place of a BooleanQuery for the hashes (since they are all in the same field). I don't really understand why you are seeing so much cost in the heap - it's sounds as if you have a single heap with mixed scores - those generated by the BooleanQuery and t

Re: [VOTE] Lucene logo contest

2020-06-16 Thread Michael Sokolov
A non-PMC On Tue, Jun 16, 2020 at 4:52 PM Bruno Roustant wrote: > > C - current logo > not PMC > > Le mar. 16 juin 2020 à 21:38, Erik Hatcher a écrit : >> >> C - current logo >> >> On Jun 15, 2020, at 6:08 PM, Ryan Ernst wrote: >> >> Dear Lucene and Solr developers! >> >> In February a contest

Re: Lucene Approximation

2020-06-02 Thread Michael Sokolov
e this approximation or at least can get the approximated value, so > that I can use it for my own calculations. > > On 2020-06-02 18:48, Michael Sokolov wrote: > > You could append an EOF token to every indexed text, and then iterate > > over Terms to get the positions o

Re: Lucene Approximation

2020-06-02 Thread Michael Sokolov
You could append an EOF token to every indexed text, and then iterate over Terms to get the positions of those tokens? On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger wrote: > > Hello, > > I am not sure if I am at the right place here, but I got a question about > the approximation my Lucene im

Re: issue with Lucene UpdateDocument

2020-03-01 Thread Michael Sokolov
So -- you update a single document and the call to updateDocument takes 3 minutes? Or you update a single document and call commit() and that takes 3 minutes? Or -- you update 10 documents and call commit() and that takes 3 minutes? We can't help you with the level of detail you've provided. As

Re: Searching number of tokens in text field

2019-12-28 Thread Michael Sokolov
I don't know of any pre-existing thing that does exactly this, but how about a token filter that counts tokens (or positions maybe), and then appends some special token encoding the length? On Sat, Dec 28, 2019, 9:36 AM Matt Davis wrote: > Hello, > > I was wondering if it is possible to search f

Re: Using Lucene as a Document Comparison Tool

2019-12-13 Thread Michael Sokolov
Have you tried making a BooleanQuery with a term for every word in the query document as Optional? You will get a lot of matches, ranked according to the similarity. On Thu, Dec 12, 2019 at 10:47 AM John Brown wrote: > > Hi, > > > > I have some questions about how to use Lucene for the specific

Re: Get distinct fields values from lucene index

2019-11-22 Thread Michael Sokolov
In Solr and ES this is done with faceting and aggregations, respectively, based on Lucene's low-level APIs. Have you looked at TermsEnum? You can use that to get all distinct terms for a segment, and then it is up to you to coalesce terms across segments ("leaves"). On Thu, Nov 21, 2019 at 1:15 AM

Re: Question about the light and minimal French stemmers

2019-07-28 Thread Michael Sokolov
ause of it does not check if the > character is letter or not. > e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer. > > To me, this behaviour is beyond stemming. > > Tomoko > > 2019年7月28日(日) 4:55 Michael Sokolov : > > > > I&

Re: Question about the light and minimal French stemmers

2019-07-27 Thread Michael Sokolov
I'm not so sure. I think the whole idea of having both stemmers is that the minimal one does less than the light one. Removing the final character of a double letter suffix is going to sacrifice some precision. For example mes/mess, ne/née, I'm sure there are others. So having both options is hel

Re: [External] Re: How to ignore certain words based on query specifics

2019-07-10 Thread Michael Sokolov
ocument, > and if the term also matches an ignore word, then ignore the match. > > I hadn't considered the stopwords approach, I'll look into that. > If I add all the ignore words as stop words, will that effect highlighting? > Are the stopwords still available for highlight

  1   2   3   >