Re: Using setIndexSort on a binary field

2021-10-18 Thread Alex K
s > to segments - you could apply this to an existing index. But again, > this is not really intended for use in a production on-line index that > receives updates. > > On Fri, Oct 15, 2021 at 1:27 PM Alex K wrote: > > > > Thanks Adrien. This makes me think I might not be

Re: Using setIndexSort on a binary field

2021-10-15 Thread Alex K
only indexes the data while index > sorting requires doc values. > > On Fri, Oct 15, 2021 at 6:40 PM Alex K wrote: > > > Hi all, > > > > Could someone point me to an example of using the > > IndexWriterConfig.setIndexSort for a field containing binary values? &g

Using setIndexSort on a binary field

2021-10-15 Thread Alex K
Hi all, Could someone point me to an example of using the IndexWriterConfig.setIndexSort for a field containing binary values? To be specific, the fields are constructed using the Field(String name, byte[] value, IndexableFieldType type) constructor, and I'd like to try using the java.util.Arrays

Re: Control the number of segments without using forceMerge.

2021-07-05 Thread Alex K
.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>, and his great post about MMapDirectory from a few years ago <https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>. Definitely recommended for others. Thanks, Alex On Mon, Jul 5,

Re: Control the number of segments without using forceMerge.

2021-07-05 Thread Alex K
ene to run > a single query over so many indexes. > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Alex K > > Sent: Monday, July 5, 2021 4:04 AM

Re: Does Lucene have anything like a covering index as an alternative to DocValues?

2021-07-05 Thread Alex K
ID is a > typical use > > case for an inverted index. If you still need to store it as DocValues > field, just > > add it with both types. > > > > Uwe > > > > - > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > >

Control the number of segments without using forceMerge.

2021-07-04 Thread Alex K
Hi all, I'm trying to figure out if there is a way to control the number of segments in an index without explicitly calling forceMerge. My use-case looks like this: I need to index a static dataset of ~1 billion documents. I know the exact number of docs before indexing starts. I know the VM wher

Does Lucene have anything like a covering index as an alternative to DocValues?

2021-07-04 Thread Alex K
Hi all, I am curious if there is anything in Lucene that resembles a covering index (from the relational database world) as an alternative to DocValues for commonly-accessed values? Consider the following use-case: I'm indexing docs in a Lucene index. Each doc has some terms, which are not stored

Re: Lucene/Solr and BERT

2021-05-26 Thread Alex K
as possible before flushing. > > -Mike > > On Wed, May 26, 2021 at 9:43 AM Michael Wechner > wrote: > > > > Hi Alex > > > > Thank you very much for your feedback and the various insights! > > > > Am 26.05.21 um 04:41 schrieb Alex K: > > >

Re: Lucene/Solr and BERT

2021-05-25 Thread Alex K
NN search algorithms, and we have > >> been working to make sure the VectorFormat API (might still get > >> renamed due to confusion with other kinds of vectors existing in > >> Lucene) can support alternative KNN implementations. > >> > >> On Wed, M

Re: Lucene/Solr and BERT

2021-04-21 Thread Alex K
There were a couple additions recently merged into lucene but not yet released: - A first-class vector codec - An implementation of HNSW for approximate nearest neighbor search They are however available in the snapshot releases. I started on a small project to get the HNSW implementation into the

Re: How to access block-max metadata?

2020-10-12 Thread Alex K
ow > > > and ImpactsSource#getImpacts ( > > > > > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html > > > ). > > > > > > You can look at ImpactsDISI to see how this metadat

Re: How to access block-max metadata?

2020-10-12 Thread Alex K
; and ImpactsSource#getImpacts ( > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html > ). > > You can look at ImpactsDISI to see how this metadata is leveraged in > practice to turn this metadata into score upper bounds, which is in-turn > used to skip i

How to access block-max metadata?

2020-10-11 Thread Alex K
Hi all, There was some fairly recent work in Lucene to introduce Block-Max WAND Scoring ( https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf , https://issues.apache.org/jira/browse/LUCENE-8135). I've been working on a use-case where I need very efficient top-k scoring

Re: Optimizing term-occurrence counting (code included)

2020-09-20 Thread Alex K
ching 10s to 100s of terms? It seems the bottleneck is in the PostingsFormat implementation. Perhaps there is a PostingsFormat better suited for this usecase? Thanks, Alex On Fri, Jul 24, 2020 at 7:59 AM Alex K wrote: > Thanks Ali. I don't think that will work in this case, since

Re: Simultaneous Indexing and searching

2020-09-02 Thread Alex K
FWIW, I agree with Michael: this is not a simple problem and there's been a lot of effort in Elasticsearch and Solr to solve it in a robust way. If you can't use ES/solr, I believe there are some posts on the ES blog about how they write/delete/merge shards (Lucene indices). On Tue, Sep 1, 2020 at

Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-26 Thread Alex K
Hi, Also have a look here: https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9378 Seems it might be related. - Alex On Sun, Jul 26, 2020, 23:31 Trejkaz wrote: > Hi all. > > I've been tracking down slow seeking performance in TermsEnum after > updating to Lucene 8.5.1. > > On 8

Re: Optimizing term-occurrence counting (code included)

2020-07-24 Thread Alex K
up in > Lucene is, but I've previously used https://github.com/npgall/cqengine for > similar stuff. It provided really good performance, especially if you're > just counting things. > > On Fri, Jul 24, 2020 at 6:55 AM Alex K wrote: > > > Hi all, > > &

Optimizing term-occurrence counting (code included)

2020-07-23 Thread Alex K
Hi all, I am working on a query that takes a set of terms, finds all documents containing at least one of those terms, computes a subset of candidate docs with the most matching terms, and applies a user-provided scoring function to each of the candidate docs Simple example of the query: - query

Re: ANN search current state

2020-07-15 Thread Alex K
Hi Mikhail, I'm not sure about the state of ANN in lucene proper. Very interested to see the response from others. I've been doing some work on ANN for an Elasticsearch plugin: http://elastiknn.klibisz.com/ I think it's possible to extract my custom queries and modeling code so that it's elasticse

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Alex K
d > [3] : https://arxiv.org/abs/1910.10208 > > > > > > On Wed, 24 Jun 2020 at 19:47, Alex K wrote: > > > Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem > > space! > > > > My implementation isn't specific to any pa

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen wrote: > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote: > > I'm working on an Elasticsearch plugin (using Lucene internally) that > > allows users to index numerical vectors and run exact and approximate > > k-nearest

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
of the speed... > > On Tue, Jun 23, 2020 at 8:52 PM Alex K wrote: > > > > The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem > to > > return the number of terms that matched in a given document. Rather it > just > > returns the boost v

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
n 23, 2020 at 3:17 PM Alex K wrote: > Hi Michael, > Thanks for the quick response! > > I will look into the TermInSetQuery. > > My usage of "heap" might've been confusing. > I'm using a FunctionScoreQuery from Elasticsearch. > This gets instantiated with

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
e there really two heaps? Do you override the standard > collector? > > On Tue, Jun 23, 2020 at 9:51 AM Alex K wrote: > > > > Hello all, > > > > I'm working on an Elasticsearch plugin (using Lucene internally) that > > allows users to index numerical vectors

Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
Hello all, I'm working on an Elasticsearch plugin (using Lucene internally) that allows users to index numerical vectors and run exact and approximate k-nearest-neighbors similarity queries. I'd like to get some feedback about my usage of BooleanQueries and TermQueries, and see if there are any op