from:"Michael Froh"

Lucene Index Cloud Replication

2019-07-03 Thread Michael Froh

Hi there,

I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
storing and retrieving Lucene indexes in S3, and realized that "uploading a
Lucene directory to the cloud and downloading it on other machines" is a
pretty common problem and one that's surprisingly easy to do poorly. In my
current job, I'm on my third team that needed to do this.

In my experience, there are three main pieces that need to be implemented:

1. Uploading/downloading individual files (i.e. the blob store), which can
be eventually consistent if you write once.
2. Describing the metadata for a specific commit point (basically what the
Replicator module does with the "Revision" class). In particular, we want a
downloader to reliably be able to know if they already have specific files
(and don't need to download them again).
3. Sharing metadata with some degree of consistency, so that multiple
writers don't clobber each other's metadata, and so readers can discover
the metadata for the latest commit/revision and trust that they'll
(eventually) be able to download the relevant files.

I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
I'd like to do it with  interfaces that lend themselves to other
implementations for blob and metadata storage.

Is it worth opening a Jira issue for this? Is this something that would
benefit the Lucene community?

Thanks,
Michael Froh

Re: PhraseQuery

2020-01-24 Thread Michael Froh

Did you check the Javadoc for PhraseQuery.Builder?

https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/PhraseQuery.Builder.html

Checking the source code, I see that the add method that takes a position
argument will throw an IllegalArgumentException if you try to add a Term in
a lower position than the previous Term. (That is, Term positions must be
non-decreasing.)

Hope that helps,
Michael

On Fri, 24 Jan 2020 at 09:45,  wrote:

> Hi,-
>
>   how do i enforce the order of sequence of terms in the PhraseQuery
> builder?
>   Lucene docs are very hard to understand in terms of api descriptions.
>
>
> https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/PhraseQuery.html
> Best regards
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Scoring Across Multiple Fields

2020-01-27 Thread Michael Froh

Hi John,

A TermQuery produces a scorer that can compute similarity for a given term
value against a given field, in the context of the index, so as you say, it
produces a score for one field.

If you want to match a given term value across multiple fields, indeed you
could use a BooleanQuery with the TermQueries in SHOULD clauses. The
vanilla BooleanQuery produces a score which is the sum of all matching
clauses' scores (or at least that's the interpretation I get from reading
the source code of the explain() method in BooleanWeight).

You can also look into DisjunctionMaxQuery, which works like a disjunctive
BooleanQuery, but it returns the maximum score across matching clauses. The
idea here is that if, say, you're matching across title and body fields, a
title match may score higher (perhaps because it's been boosted). If you
sum the scores across fields, you're likely just inflating those title
matches even more (since a title match is probably highly correlated with a
body match). (The DisjunctionMaxQuery also has a an optional
"tieBreakerMultiplier" property that you can use to weight the scoring
somewhere between pure max and pure sum -- like "Use the maximum score,
plus 0.001 times the sum of the rest".)

Hope that helps,
Michael

On Mon, 27 Jan 2020 at 13:37, John Brown  wrote:

> Hi,
>
> I have a question regarding how Lucene computes document similarities from
> field similarities.
>
> Lucene's scoring documentation mentions that scoring works on fields and
> combines the results to return documents. I'm assuming fields are given
> scores, and those scores are simply averaged to return the document score?
>
> If this is the case, then in order to incorporate multiple fields in my
> scoring, I would use multiple term queries that contain the same term, but
> target different fields, then I would simply put them in a boolean query,
> and search my index using this boolean query.
>
> Am I going about this in the correct way? Any clarification would be
> greatly appreciated.
>
> Thank you,
> John B
>

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

2020-02-17 Thread Michael Froh

Hi Baris,

The idea with PhraseWildcardQuery is that you can mix literal "exact" terms
with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using addTerm is
for exact terms, while addMultiTerm is for things that may match a number
of possible terms in the given position.

If you want to search for term1 followed by any term that starts with a
given character, I would suggest using:

int maxMultiTermExpansions = ...; // Discussed below
PhraseWildCardQuery.Builder builder = new PhraseWildcardQuery("field",
maxMultiTermExpansions);
builder.addTerm(new BytesRef("term1")); // Add fixed term in position 0
builder.addMultiTerm(new PrefixQuery(new Term("field", "term2FirstChar")));
// Add multiterm in position 1
Query q = builder.build();

The PrefixQuery effectively gets expanded into a bunch of possible terms,
based on the term dictionary on each index segment. To avoid expanding to
cover too many terms (say, if you added a bunch of WildcardQuery),
maxMultiTermExpansions serves as a guard rail, to put a rough bound on
memory consumption and query execution time. If you're interested in
details of how the maxMultiTermExpansions budget is distributed across
MultiTerms, check out PhraseWildcardQuery.createWeight. If you're just
running an experiment in your IDE, you could probably set
maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running in a
production environment, it's likely a good idea to tune it down based on
your memory/latency constraints.)

Incidentally, for tracking down the source code for anything in Lucene,
it's probably better to go to GitHub for the most up-to-date source:
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java
.

Hope that helps,
Michael

On Thu, 13 Feb 2020 at 12:29,  wrote:

> Hi,-
>
> i hope everyone is doing great.
>
>   if i want to do the following search with PhraseWildCardQuery and
> thanks to this forum for letting me know about this class (Especially to
> David and Bruno)
>
> term1 term2FirstChar*
>
> i need to do two ways: (i found the source code at
>
> https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java
> )
>
> /*
>
> maxMultiTermExpansions - The maximum number of expansions across all
> multi-terms and across all segments. It counts expansions for each
> segments individually, that allows optimizations per segment and unused
> expansions are credited to next segments. This is different from
> MultiPhraseQuery and SpanMultiTermQueryWrapper which have an expansion
> limit per multi-term.
>
> segmentOptimizationEnabled - Whether to enable the segment optimization
> which consists in ignoring a segment for further analysis as soon as a
> term is not present inside it. This optimizes the query execution
> performance but changes the scoring. The result ranking is preserved.
>
> */
>
>
> 1st way:
>
> PhraseWildCardQuery.Builder builder = PharseWildCardQuery.Builder(field,
> 2 _*/<<< i dont know what number to use here for
> maxMultiTermExpansions>>>/*_, true/*boolean segmentOptimizationEnabled*/)
>
> pwcqBuilder.addTerm(field, new Term(field, "term1"));
>
> pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar"));
>
> PhraseWildCardQuery pwcq = pwcqBuilder.build();
>
> or
>
> 2nd way:
>
> pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning {field,
> "term1"} and {field ,"term2FirstChar"});
>
> PhraseWildCardQuery pwcq = pwcqBuilder.build();
>
>
> Then this pwcq object will be fed into IndexSearcher's as the query
> parameter.
>
>
> Now, it looks like the first way will not consider expansions or in
> other words wildcard? Am i right?
>
> i also need to understand this maxMultiTermExpansions parameter better.
> For instance if first way is used, will maxMultiTermExpansions be
> meaningful?
>
>
> Thanks
>
>

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

2020-02-18 Thread Michael Froh

In your example, it looks like you wanted the second term to match based on
the first character, or prefix, of the term.

While you could use a WildcardQuery with a term value of "term2FirstChar*",
PrefixQuery seemed like the simpler approach. WildcardQuery can handle more
general cases, like if you want to match on something like "a*b*c".

Technically, the PrefixQuery compiles down to a slightly simpler automaton,
but I only figured that out by writing a simple unit test:

public void testAutomata() {
Automaton prefixAutomaton = PrefixQuery.toAutomaton(new
BytesRef("a"));
Automaton wildcardAutomaton = WildcardQuery.toAutomaton(new
Term("foo", "a*"));

System.out.println("PrefixQuery(\"a\")");
System.out.println(prefixAutomaton.toDot());
System.out.println("WildcardQuery(\"a*\")");
System.out.println(wildcardAutomaton.toDot());
}

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U-\\U00ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U-\\U0010"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U-\\U0010"]
}



On Tue, 18 Feb 2020 at 13:52,  wrote:

> Michael and Forum,-
> Thanks for thegreat explanations.
>
> one question please:
>
> why is PrefixQuery used instead of WildCardQuery in the below snippet?
>
> Best regards
>
> > On Feb 17, 2020, at 3:01 PM, Michael Froh  wrote:
> >
> > Hi Baris,
> >
> > The idea with PhraseWildcardQuery is that you can mix literal "exact"
> terms
> > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using addTerm is
> > for exact terms, while addMultiTerm is for things that may match a number
> > of possible terms in the given position.
> >
> > If you want to search for term1 followed by any term that starts with a
> > given character, I would suggest using:
> >
> > int maxMultiTermExpansions = ...; // Discussed below
> > PhraseWildCardQuery.Builder builder = new PhraseWildcardQuery("field",
> > maxMultiTermExpansions);
> > builder.addTerm(new BytesRef("term1")); // Add fixed term in position 0
> > builder.addMultiTerm(new PrefixQuery(new Term("field",
> "term2FirstChar")));
> > // Add multiterm in position 1
> > Query q = builder.build();
> >
> > The PrefixQuery effectively gets expanded into a bunch of possible terms,
> > based on the term dictionary on each index segment. To avoid expanding to
> > cover too many terms (say, if you added a bunch of WildcardQuery),
> > maxMultiTermExpansions serves as a guard rail, to put a rough bound on
> > memory consumption and query execution time. If you're interested in
> > details of how the maxMultiTermExpansions budget is distributed across
> > MultiTerms, check out PhraseWildcardQuery.createWeight. If you're just
> > running an experiment in your IDE, you could probably set
> > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running in a
> > production environment, it's likely a good idea to tune it down based on
> > your memory/latency constraints.)
> >
> > Incidentally, for tracking down the source code for anything in Lucene,
> > it's probably better to go to GitHub for the most up-to-date source:
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$
> > .
> >
> > Hope that helps,
> > Michael
> >
> >> On Thu, 13 Feb 2020 at 12:29,  wrote:
> >>
> >> Hi,-
> >>
> >> i hope everyone is doing great.
> >>
> >>  if i want to do the following search with PhraseWildCardQuery and
> >> thanks to this forum for letting me know about this class (Especially to
> >> David and Bruno)
> >>
> >> term1 term2FirstChar*
> >>
> >> i need to do two ways: (i found the source code at
> >

Re: What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET

2020-02-24 Thread Michael Froh

Those words (
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java#L44-L49)
have been moved to EnglishAnalyzer (
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.4.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L47-L51
).

On Mon, 24 Feb 2020 at 15:56,  wrote:

> Hi,-
>
>   I hope everyone is doing great.
>
> What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET?
>
>
> https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html#STOP_WORDS_SET
>
>
> https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>
> Best regards
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: How can I boost score of a document if two consecutive terms match

2020-11-02 Thread Michael Froh

Hi John,

What you're looking for sounds like Solr's pf2 parameter (see
https://lucene.apache.org/solr/guide/8_6/the-extended-dismax-query-parser.html#extended-dismax-parameters
and
https://lucene.apache.org/solr/guide/8_6/the-dismax-query-parser.html#pf-phrase-fields-parameter
for details).

Basically, behind the scenes, it takes successive pairs of terms, and
treats them as boosted phrase query clauses. So, a query like "t1 t2 t3"
with a pf2 boost of 5 would become roughly:

t1 OR t2 OR t3 OR "t1 t2"^5 OR "t2 t3"^5

Alternatively, since it sounds like you want to boost matches where two
consecutive words are both present in the same document, rather than
requiring that they're present in order, you could parse the query to:

t1 OR t2 OR t2 OR (t1 AND t2)^5 OR (t2 AND t3)^5

Are you using a QueryParser implementation or are you just running the
query string through an Analyzer and producing your own BooleanQuery? If
the latter, you could directly produce the second query (wrapping the
nested AND queries in a BoostQuery).

Would that do what you want?

Michael

On Fri, Oct 30, 2020 at 2:15 PM YAN PAN  wrote:

> Hi there,
> I recently am developing my own search based on lucene, here is the use
> case I am concerned about.
>
> we have two documents in the index
> a) content:new jersey
> b) content:new year
>
> the query is "he is celebrating the new year in jersey city".
>
>
> If I tokenize the queries and add all terms to a boolean query, the
> document will have the same score for the two queries, but what I want is
> that b scores higher than a, what similarity should I use, or how can I
> tweak the internal of Lucene to achieve the goal?
>
> Please note that I cannot extract the phrase "new year" at compile time, so
> it seems to me that PhraseQuery is  not an approach.
>
> Thank you very much for the help!
> John
>

Re: DisjunctionMinQuery

2023-11-08 Thread Michael Froh

Hi Marc,

Can you clarify what the semantics of a DisjunctionMinQuery would be? Would
you keep the score for the *lowest* scoring disjunct (plus some tiebreaker
applied to the other matching disjuncts)?

I'm trying to imagine how that would work compared to the classic DisMax
use-case. Say I'm searching for "dalmatian" using a DisMax query over term
queries against title and body. A match on title is probably going to score
higher than a match against the body, just because the title has a shorter
length (and the doc frequency of individual terms in the title is likely to
be lower, since there are fewer terms overall). With DisMax, a match on
title alone will score higher than a match on body, and the tie-break will
tend to score a match on title and body higher than a match on title alone.

With a DisMin (assuming you keep the lowest score), then a match on title
and body would probably score lower than a match on title alone. That feels
weird to me, but I might be missing the use-case.

How would you use a DisMinQuery?

Thanks,
Froh

On Wed, Nov 8, 2023 at 10:50 AM Marc D'Mello  wrote:

> Hi all,
>
> I noticed we have a DisjunctionMaxQuery
> <
> https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java
> >
> but
> not a corresponding DisjunctionMinQuery. I was just wondering if there was
> a specific reason for that? Or is it just that it is not a common query to
> use?
>
> Thanks!
> Marc
>

Re: get distinct values from indexreader for given field

2023-11-28 Thread Michael Froh

Hello!

Instead of MultiFields.getFields(), you can use MultiTerms.getTerms(reader,
fieldname) to get the Terms instance.

To decode your long / int values, you should be able to use
LongPoint/IntPoint.unpack to write the values into an array:

long[] val = new long[1]; // Assuming 1-D values
LongPoint.unpack(value, 0, val);
values.add(val[0]);

Hope that helps,
Froh


On Wed, Nov 22, 2023 at 11:09 AM  wrote:

> Hello,
>
> In Lucene 6 I was doing this to get all values for a given field
> knowing its type:
>
> public List getDistinctValues(IndexReader reader, String fieldname,
> Class type) throws IOException {
>
> List values = new ArrayList();
> Fields fields = MultiFields.getFields(reader);
> if (fields == null) return values;
>
> Terms terms = fields.terms(fieldname);
> if (terms == null) return values;
>
> TermsEnum iterator = terms.iterator();
>
> BytesRef value = iterator.next();
>
> while (value != null) {
> if (type == Long.class) {
> values.add(LegacyNumericUtils.prefixCodedToLong(value));
> } else if (type == Integer.class) {
> values.add(LegacyNumericUtils.prefixCodedToInt(value));
> } else if (type == Boolean.class) {
> values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ?
> TRUE : FALSE);
> } else if (type == Date.class) {
> values.add(new
> Date(LegacyNumericUtils.prefixCodedToLong(value)));
> } else if (type == String.class) {
> values.add(value.utf8ToString());
> } else {
> // ...
> }
>
> value = iterator.next();
> }
>
> return values;
> }
>
> I am trying to upgrade to lucene 9.
> there were 2 changes over time:
> - LegacyNumericUtils has been removed in favor of PointBase
> - MultiFields.getFields() has been dropped, and I read we were encouraged
> to avoid fields in general
>
> what is proper way to implement getting distinct values for a specific
> field in a reader?
>
> thanks for your help,
>
> vs
>

Re: get distinct values from indexreader for given field

2023-11-28 Thread Michael Froh

Oh -- of course if you're using IntPoint / LongPoint for your numeric
fields, they won't be indexed as terms, so loading terms for them won't
work.

It's not the prettiest solution, but I think the following should let you
collect the set of distinct point values for an IntPoint field:


final Set collectedValues = new TreeSet<>();
for (LeafReaderContext lrc : reader.leaves()) {
LeafReader lr = lrc.reader();
PointValues.IntersectVisitor collectingVisitor = new
PointValues.IntersectVisitor() {
@Override
public void visit(int docID) throws IOException {

}

@Override
public void visit(int docID, byte[] packedValue) {

collectedValues.add(IntPoint.decodeDimension(packedValue, 0));
}

@Override
public PointValues.Relation compare(byte[]
minPackedValue, byte[] maxPackedValue) {
return PointValues.Relation.CELL_CROSSES_QUERY;
}
};

lr.getPointValues(fieldname).intersect(collectingVisitor);
}



On Tue, Nov 28, 2023 at 1:42 PM Michael Froh  wrote:

> Hello!
>
> Instead of MultiFields.getFields(), you can use
> MultiTerms.getTerms(reader, fieldname) to get the Terms instance.
>
> To decode your long / int values, you should be able to use
> LongPoint/IntPoint.unpack to write the values into an array:
>
> long[] val = new long[1]; // Assuming 1-D values
> LongPoint.unpack(value, 0, val);
> values.add(val[0]);
>
> Hope that helps,
> Froh
>
>
> On Wed, Nov 22, 2023 at 11:09 AM  wrote:
>
>> Hello,
>>
>> In Lucene 6 I was doing this to get all values for a given field
>> knowing its type:
>>
>> public List getDistinctValues(IndexReader reader, String
>> fieldname,
>> Class type) throws IOException {
>>
>> List values = new ArrayList();
>> Fields fields = MultiFields.getFields(reader);
>> if (fields == null) return values;
>>
>> Terms terms = fields.terms(fieldname);
>> if (terms == null) return values;
>>
>> TermsEnum iterator = terms.iterator();
>>
>> BytesRef value = iterator.next();
>>
>> while (value != null) {
>> if (type == Long.class) {
>> values.add(LegacyNumericUtils.prefixCodedToLong(value));
>> } else if (type == Integer.class) {
>> values.add(LegacyNumericUtils.prefixCodedToInt(value));
>> } else if (type == Boolean.class) {
>> values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ?
>> TRUE : FALSE);
>> } else if (type == Date.class) {
>> values.add(new
>> Date(LegacyNumericUtils.prefixCodedToLong(value)));
>> } else if (type == String.class) {
>> values.add(value.utf8ToString());
>> } else {
>> // ...
>> }
>>
>> value = iterator.next();
>> }
>>
>> return values;
>> }
>>
>> I am trying to upgrade to lucene 9.
>> there were 2 changes over time:
>> - LegacyNumericUtils has been removed in favor of PointBase
>> - MultiFields.getFields() has been dropped, and I read we were encouraged
>> to avoid fields in general
>>
>> what is proper way to implement getting distinct values for a specific
>> field in a reader?
>>
>> thanks for your help,
>>
>> vs
>>
>

Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

2024-08-09 Thread Michael Froh

Hi Wojtek,

Thank you for linking to your test code!

When you open an IndexReader, it is locked to the view of the Lucene
directory at the time that it's opened.

If you make changes, you'll need to open a new IndexReader before those
changes are visible. I see that you tried creating a new IndexSearcher, but
unfortunately that's not sufficient.

Hope that helps!
Froh

On Fri, Aug 9, 2024 at 3:25 PM Wojtek  wrote:

> Hi all!
>
> There is an effort in Apache James to update to a more modern version of
> Lucene (ref:
> https://github.com/apache/james-project/pull/2342). I'm digging into the
> issue as other have done
> but I'm stumped - it seems that
> `org.apache.lucene.index.IndexWriter#updateDocument` doesn't update
> the document.
>
> Documentation
> (
> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable))
>
> states:
>
>  > Updates a document by first deleting the document(s) containing term
> and then adding the new
> document. The delete and then add are atomic as seen by a reader on the
> same index (flush may happen
> only after the add).
>
> Here is a simple test with it:
>
> https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
>
> but it fails.
>
> Any guidance would be appreciated because I (and others) have been hitting
> wall with it :)
>
> --
> Wojtek
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Learning resources for Lucene Development

2024-10-09 Thread Michael Froh

Hi Marc,

In some shameless self-promotion, I've written up some worked Lucene
examples (maybe a little more focused on Lucene internals than best
practices) over at https://github.com/msfroh/lucene-university. If you have
anything you'd like to understand better, feel free to open issues there
and I'm happy to write up examples. (I've started getting a few other
contributors, which is pretty cool.)

I also used to run a weekly Lucene study group on Zoom (with a bit of a
focus on OpenSearch, but still mostly aimed at Lucene). The last occurrence
was in August (https://www.meetup.com/opensearch/events/302874684/). I'd
love to start that back up and switch to more of a question and answer /
tutorial format. We could record them and stick them up on YouTube, if that
would be helpful.

Thanks,
Froh

On Tue, Oct 8, 2024 at 7:46 PM Navneet Verma 
wrote:

> +1 on the question.
>
> On Tue, Oct 8, 2024 at 6:35 PM Marc Davenport
>  wrote:
>
> > Hello,
> > I had this question buried in a previous email.  I feel like I have a
> very
> > loose grasp on the Lucene API and how to properly implement with it.  I'm
> > working on code that I didn't write myself from the ground up. Since I'm
> > learning as I'm reading it, I can only assume things were done right. As
> I
> > look to improve and change our code, I don't know how to distinguish
> what's
> > good and bad practice.   We leverage some features that don't seem to be
> > rudamentary. I'm looking for any in person learning opportunities,
> > training, workshops, etc that really dives into how lucene works to give
> me
> > a better handle on what we have vs what we could.  I'm open to personal
> > direct education if that's an option.  Any professional development
> > opportunities, consultant groups with a good education reputation,  etc
> > would be welcome.
> > Thank you,
> > Marc
> >
>

Re: Understanding Document ID (Lucene 10.0.0)

2024-10-25 Thread Michael Froh

Hi Prashant,

For your particular use-case, you probably don't need to join across
multiple indices.

Lucene is able to maintain multiple data structures per field, with the
selection of data structures coming from attributes of the field's type. If
you have a field that you want to return, but doesn't need to be searchable
(like your HTML report), you can add it as an unindexed string field that's
stored. That will write it to the stored fields data structure (which is
used to populate search results), but won't build a full-text index for it.

The slight downside of that approach is that all stored fields for a
document are compressed and written together. If users mostly just want the
name, age, and city fields (and only rarely care about the report field),
then maybe storing it in a separate index might make sense. In that case,
adding an ID keyword field to both indices is a viable option. Doing a term
query on the secondary index to find the appropriate docs should generally
be quite fast -- while Lucene is not primarily a key-value store, it works
surprisingly well as one.

Hope that helps,
Froh

On Fri, Oct 25, 2024 at 8:28 AM Prashant Saxena 
wrote:

> I'm new to Lucene and trying to understand the concept of unique document
> id, something like a primary key in databases like sql or sqlite etc.
> While searching, I came across this article:
> https://blog.mikemccandless.com/2014/05/choosing-which actually
> fast-unique-identifier-uuid.html
> <
> https://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
> >
> which actually is quite old but it's been said elsewhere, it's still
> applicable on the latest version as internally things
> are not changed much in Lucene.
>
> What have I done so far?
>
> I have created a simple index where few text files are written as
> documents, when I open this index in Luke (GUI), on the *Document* tab,
> I see an option along with a spinner control:
>
> *Browse document by Doc # |  0| in 100 docs*
>
> If you change the value, it shows the document at the bottom of the GUI.
> The id seems to be a number in which documents are stored.
>
> *Question:*
> How can one access this id?
>
> Why do I need a unique id?
>
> Let's assume I have created a simple index with three fields: name, age &
> city. There is another field, the associated long html text,
> which I am writing in another index. In a GUI environment where users can
> search by typing the search term in four of the fields.
>
> name |  |
> age|  |
> city|  |
> report ||
> Usually, people are interested in the first three fields. Report field is
> not used as much but still available if somebody is interested.
>
> *Option 1*
> When the user is searching only using name, age, city, I'll open the first
> index, do the search, get the documents and their ids, get the report field
> directly from the second index using the id. This way
> no searching is required in the second index.
>
> *Option 2*
> I have recently started learning Lucene and right now I haven't touch the
> joining part but still here is the question
> If a user has given a search term in all of the four fields then logically
> you have to search in both the indexes and find the common doc ids in both
> searched results.
>
> *Question*
> How will this joining happen, to get the correct results from both the
> indexes? If possible please refer to some online code example links.
>
> Prashant
>

Re: NRT segment replication in AWS

2025-03-03 Thread Michael Froh

On Sun, Mar 2, 2025 at 7:21 AM Marc Davenport
 wrote:

>
> @Michael - That second simpler architecture is very similar to what we are
> considering; With the exception of a queue for announcing new
> segments rather than a polling process. It is good to know that it's a
> reasonable outline.   You were very latency sensitive. Is there anything
> you can share around the most important specs of your containers, pods, or
> even nodes?  While we run on these EC2 instances, there is a push to get us
> into k8s.   Did you have any issues as you migrated from one version of
> lucene to another.  I'm concerned that our current deployments only allow
> one version of the software in production at any one time.
>
>
>From what I remember, we did see a bit of noisy-neighbor behavior running
two searchers in separate containers on the same instance.  I don't think
we ever found a concrete reason (at least while I was there). We gave up
and ended up using smaller instances instead of putting multiple searchers
on the same instance.

For migration between different Lucene versions (or even versions of our
software, which might change indexing behavior), we used the process
described in
https://careersatdoordash.com/blog/introducing-doordashs-in-house-search-engine/
in the section "Tenant Isolation and Search Stacks". Essentially, bring up
a new indexer fleet, build a new index, then start bringing up new
searchers that replicate from the new indexers.

Re: NRT segment replication in AWS

2025-02-26 Thread Michael Froh

Hi there,

I'm happy to share some details about how Amazon Product Search does its
segment replication. I haven't worked on Product Search in over three
years, so anything that I remember is not particularly novel. Also, it's
not really secret sauce -- I would have happily talked about it more in the
2021 re:Invent talk that Mike Sokolov and I did, but we were trying to keep
within our time limit. :)

That model doesn't exactly have direct communication between primary and
replica (which is generally a good practice in a cloud-based solution --
the fewer node-to-node dependencies, the better). The flow (if I recall
correctly) is driven by a couple of side-car components for the writers and
searchers and is roughly like this:

1. At a specified (pretty coarse) interval, the writer side-car calls a
"create checkpoint" API on the writer to ask it to write a checkpoint.
2. The writer uploads new segment files to S3, and a metadata object
describing the checkpoint contents (which probably includes segments from
an earlier checkpoint, since they can be reused).
3. The writer returns the S3 URL for the metadata object to its side-car.
4. The write side-car publishes the metadata URL to "something" -- see
below for details.
5. The searcher side-cars all read the metadata URL from "something" -- see
below for details.
6. The searcher side-cars each call a "use checkpoint" API on their local
searchers
7. The searchers each download the new segment files from S3 and open new
IndexSearchers.

For the details of steps 4 and 5, I don't actually remember how it worked,
but I have two pretty good guesses from what I remember of the overall
architecture:

1. DynamoDB: This is the more likely mechanism. Each index shard has a
unique ID which serves as a partition key in DynamoDB and there's a
sequence number as a sort key. The writer side-car inserts a DynamoDB
record with the next sequence number and the metadata URL. The searcher
side-car periodically fetches 1 record with the partition key by descending
sequence number (i.e. get latest sequence entry for the partition key). If
the sequence number has increased, then call the searcher's use-checkpoint
API.
2. Kinesis: This feels like the less likely mechanism, but I guess it could
work. The writer side-car writes the metadata URL to a Kinesis stream. Each
searcher side-car reads from the Kinesis stream and passes the metadata URL
to the searcher. I'm pretty sure we didn't have one Kinesis stream per
index shard, because managing (and paying for) that many Kinesis streams
would be a pain. Even with a sharded Kinesis stream, you'd "leak" some
checkpoints across index shards, leading to data that the searcher
side-cars would throw away. Also, each Kinesis stream shard has a limited
number of concurrent consumers, which would mean that the number of search
replicas would be limited. I'm *pretty sure* we used the DynamoDB approach.

Another Lucene-based search system that I worked on many years ago at
Amazon had a much simpler architecture:

1. Writer periodically writes new segments to S3.
2. After writing the new segments, the writer writes a metadata object to
S3 with a path like "/metadata.json". Because the writer was guaranteed to be the *only*
thing writing with prefixes of , it could manage its own dense
sequence numbers.
3. A searcher is "sticky" to a writer, and periodically issues an S3
GetObject for the next metadata object's full URL (i.e. the URL using the
next dense sequence number). Until the next checkpoint is written, it gets
a 404 response.
4. Searcher fetches the files referenced by the metadata file.

A nice thing about that approach was that it only depended on S3 and only
used PutObject and GetObject APIs, which tend to be more consistent. The
downside was that we needed a separate mechanism for writer discovery and
failover, to let searchers know the correct writer prefix.

Hope that helps! Let me know if you need any other suggestions.

Thanks,
Froh

On Wed, Feb 26, 2025 at 3:31 PM Steven Schlansker <
stevenschlans...@gmail.com> wrote:

>
>
> > On Feb 26, 2025, at 2:53 PM, Marc Davenport 
> > 
> wrote:
> >
> > Hello,
> > Our current search solution is a pretty big monolith running on pretty
> > beefy EC2 instances.  Every node is responsible for indexing and serving
> > queries.  We want to start decomposing our service and are starting with
> > separating the indexing and query handling responsibilities.
>
> We run a probably comparatively small but otherwise similar installation,
> using
> Google Kubernetes instances. We just use a persistent disk instead of an
> elastic store, but
> also would consider using something like S3 in the future.
>
> > I'm in the research phases now trying to collect any prior art I can. The
> > rough sketch is to implement the NRT two replication node classes on
> their
> > respective services and use S3 as a distribution point for the segment
> > files.  I'm still debating if there should be some direct knowledge o

Re: Synonym graph and multiple values

2025-03-25 Thread Michael Froh

This relates to the "position increment gap" for your analyzer and is
configurable.

If you check the JavaDoc for Analyzer#getPositionIncrementGap, it says:

   * Invoked before indexing a IndexableField instance if terms have
already been added to that
   * field. This allows custom analyzers to place an automatic position
increment gap between
   * IndexbleField instances using the same field name. The default value
position increment gap is
   * 0. With a 0 position increment gap and the typical default token
position increment of 1, all
   * terms in a field, including across IndexableField instances, are in
successive positions,
   * allowing exact PhraseQuery matches, for instance, across
IndexableField instance boundaries.

So, if you want "ab", "c d" to behave the same as "a b c d", you would use
the default gap of 0. If you want them to
behave differently, you can add a gap between successive values to prevent
matching across them.

Essentially, the position increment gap adds some number of "holes" (empty
positions) between values. So, if you
add a gap of 10, then the terms for "a b", "c d" would be in the following
positions, I believe:
0   a
1   b
12  c
13  d

Phrase matching works by checking if the term positions differ by the
appropriate amount. If you have stop word
removal, the above example might match the phrase "b the the the the the
the the the the the c", because the
"thes" (I write, as I'm currently wearing a t-shirt from the band "The
The") would also map to empty positions.

Hope that helps,
Froh

On Tue, Mar 25, 2025 at 9:47 AM Kai Grossjohann
 wrote:

> Hi,
>
> I'd like to understand more about how multiple values of a field are
> handled.  Consider a Lucene document with a field foo that has a single
> value “a b c d” versus another Lucene document where the field foo has
> two values, namely “a b” and “c d”.
>
> When using Synonym Graph (so that synonym phrases are supported), and
> supposing I have a synonym phrase “b c”...
>
>   * I suppose the Lucene document with the single value “a b c d”
> matches this synonym phrase, but
>   * does the other document match this phrase, as well?
>
> In a similar vein, how to phrase queries behave?  If I query for the
> phrase “b c” will the two-value document match?
>
> Thanks,
> Kai
>

Re: Sub-Graphs in Hnsw

2025-06-05 Thread Michael Froh

I'm wondering if this is the same idea that Kaival is proposing in
https://github.com/apache/lucene/issues/14758 (Support multiple HNSW graphs
backed by the same vectors).

On Thu, Jun 5, 2025 at 11:32 AM Michael Sokolov  wrote:

> I do think there could be many interesting use cases for building
> multiple graphs from a single set of vectors.  For example, one might
> want to sometimes search all the docs, sometimes search the one subset
> and other times another subset; baking the constraint into the graph
> construction would be lead to more efficient searches than the other
> graph search filtering we can do today (pre- and post-filtering) and
> there could be use cases where the constraints are so very often
> present that we would want to pay the up-front cost of computing
> multiple graphs without paying the cost of storing the same vectors
> multiple times in the index.  This isn't supported today but I think
> would be a welcome contribution.
>
> On Wed, Jun 4, 2025 at 3:51 AM Ravikumar Govindarajan
>  wrote:
> >
> > >
> > > I wonder if you could influence the graph search by incorporating the
> > > partition key (customer id?) to the vectors somehow? If this was done
> > > well it should lead to a natural clustering of the graph.
> > >
> >
> > I can explore further on this. Thanks for the pointers..
> >
> > On Mon, Jun 2, 2025 at 11:14 PM Michael Sokolov 
> wrote:
> >
> > > I wonder if you could influence the graph search by incorporating the
> > > partition key (customer id?) to the vectors somehow? If this was done
> > > well it should lead to a natural clustering of the graph.
> > >
> > > On Mon, Jun 2, 2025 at 11:32 AM Ravikumar Govindarajan
> > >  wrote:
> > > >
> > > > Hi Michael,
> > > >
> > > > The docs range could vary in extremes  from few 10s to
> tens-of-thousands
> > > > and in very heavy usage cases, 100k and above… in a single segment
> > > >
> > > > Filtered Hnsw like you said uses a single graph.., which could be
> better
> > > if
> > > > designed as sub-graphs
> > > >
> > > > On Mon, 2 Jun 2025 at 5:42 PM, Michael Sokolov 
> > > wrote:
> > > >
> > > > > How many documents do you anticipate in a typical sub range? If
> it's
> > > in the
> > > > > hundreds or even low thousands you would be better off without
> hnsw.
> > > > > Instead you can use a function score query based on the vector
> > > distance.
> > > > > For larger numbers where hnsw becomes useful, you could try using
> > > filtered
> > > > > hnsw, but this will be using a single graph constructed from all
> of the
> > > > > documents.
> > > > >
> > > > > On Mon, Jun 2, 2025, 5:25 AM Ravikumar Govindarajan <
> > > > > ravikumar.govindara...@gmail.com> wrote:
> > > > >
> > > > > > We use index-sorting to arrange segment data. The ord-ranges for
> any
> > > > > given
> > > > > > KnnVectorField is mutually exclusive
> > > > > >
> > > > > > Ex:
> > > > > > field: content
> > > > > >
> > > > > > OrdRange -> 0-100 (User1)
> > > > > > OrdRange -> 101-300 (User2)
> > > > > > and so on..
> > > > > >
> > > > > > Each OrdRange has to be a self-contained Hnsw graph with all
> > > neighbours
> > > > > > strictly inside the given OrdRange. A sub-graph, to be precise..
> The
> > > > > > generated segment will contain a lot of these sub-graphs but
> without
> > > any
> > > > > > neighbour links to each other at Level-0.  Level-1 and above can
> have
> > > > > > cross-links, which should be fine..
> > > > > >
> > > > > > Searches will be based on OrdRange and should stop once the
> > > sub-graph is
> > > > > > fully explored and not cross over to other sub-graphs..
> > > > > >
> > > > > > I can index them as different fields but it could run into a few
> > > hundreds
> > > > > > (if not thousands).
> > > > > >
> > > > > > Are there any strategies I can adopt to accomplish this? Can a
> custom
> > > > > > VectorScoringFunction solve this? (Like -> assign actual score,
> if
> > > ords
> > > > > are
> > > > > > in range. Assign 0, if out-of-range etc..)
> > > > > >
> > > > > > Is this the correct way of looking at the problem?
> > > > > >
> > > > > > Any help is much appreciated
> > > > > >
> > > > > > Regards,
> > > > > > Ravi
> > > > > >
> > > > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: need help with JoinUitl.createJoinQuery() method

2025-07-23 Thread Michael Froh

It looks like your pk_p and pk_c fields aren't indexed -- they just have
doc values.

If you try making them KeywordFields instead (so they're indexed and have
doc values), does it work?

Also, the join module may be overkill for what you're trying to do, since
it looks like you're indexing parent/child blocks anyway. You could use
ToParentBlockJoinQuery instead (though you'd need to index the child docs
together with the parent doc using IndexWriter#addDocuments). Of course,
maybe your real use-case is more complicated, so the join module
makes sense.

On Wed, Jul 23, 2025 at 12:02 PM Markos Zaharioudakis
 wrote:

> Hi,
>
> I am trying to create and run a "join" query using JoinUitl.createJoinQuery()
> in Lucene 10.2.0. However, the query returns 0 results. I am attaching my
> little test program. Can you please tell me what I am doing wrong?
>
> Thanks a lot,
> Markos.
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

Lucene Index Cloud Replication

Re: PhraseQuery

Re: Scoring Across Multiple Fields

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Re: What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET

Re: How can I boost score of a document if two consecutive terms match

Re: DisjunctionMinQuery

Re: get distinct values from indexreader for given field

Re: get distinct values from indexreader for given field

Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Re: Learning resources for Lucene Development

Re: Understanding Document ID (Lucene 10.0.0)

Re: NRT segment replication in AWS

Re: NRT segment replication in AWS

Re: Synonym graph and multiple values

Re: Sub-Graphs in Hnsw

Re: need help with JoinUitl.createJoinQuery() method

18 matches

Site Navigation

Mail list logo

Footer information