Lucene Index Cloud Replication
Hi there, I was talking with Varun at Berlin Buzzwords a couple of weeks ago about storing and retrieving Lucene indexes in S3, and realized that "uploading a Lucene directory to the cloud and downloading it on other machines" is a pretty common problem and one that's surprisingly easy to do poorly. In my current job, I'm on my third team that needed to do this. In my experience, there are three main pieces that need to be implemented: 1. Uploading/downloading individual files (i.e. the blob store), which can be eventually consistent if you write once. 2. Describing the metadata for a specific commit point (basically what the Replicator module does with the "Revision" class). In particular, we want a downloader to reliably be able to know if they already have specific files (and don't need to download them again). 3. Sharing metadata with some degree of consistency, so that multiple writers don't clobber each other's metadata, and so readers can discover the metadata for the latest commit/revision and trust that they'll (eventually) be able to download the relevant files. I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but I'd like to do it with interfaces that lend themselves to other implementations for blob and metadata storage. Is it worth opening a Jira issue for this? Is this something that would benefit the Lucene community? Thanks, Michael Froh
Re: PhraseQuery
Did you check the Javadoc for PhraseQuery.Builder? https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/PhraseQuery.Builder.html Checking the source code, I see that the add method that takes a position argument will throw an IllegalArgumentException if you try to add a Term in a lower position than the previous Term. (That is, Term positions must be non-decreasing.) Hope that helps, Michael On Fri, 24 Jan 2020 at 09:45, wrote: > Hi,- > > how do i enforce the order of sequence of terms in the PhraseQuery > builder? > Lucene docs are very hard to understand in terms of api descriptions. > > > https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/PhraseQuery.html > Best regards > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Scoring Across Multiple Fields
Hi John, A TermQuery produces a scorer that can compute similarity for a given term value against a given field, in the context of the index, so as you say, it produces a score for one field. If you want to match a given term value across multiple fields, indeed you could use a BooleanQuery with the TermQueries in SHOULD clauses. The vanilla BooleanQuery produces a score which is the sum of all matching clauses' scores (or at least that's the interpretation I get from reading the source code of the explain() method in BooleanWeight). You can also look into DisjunctionMaxQuery, which works like a disjunctive BooleanQuery, but it returns the maximum score across matching clauses. The idea here is that if, say, you're matching across title and body fields, a title match may score higher (perhaps because it's been boosted). If you sum the scores across fields, you're likely just inflating those title matches even more (since a title match is probably highly correlated with a body match). (The DisjunctionMaxQuery also has a an optional "tieBreakerMultiplier" property that you can use to weight the scoring somewhere between pure max and pure sum -- like "Use the maximum score, plus 0.001 times the sum of the rest".) Hope that helps, Michael On Mon, 27 Jan 2020 at 13:37, John Brown wrote: > Hi, > > I have a question regarding how Lucene computes document similarities from > field similarities. > > Lucene's scoring documentation mentions that scoring works on fields and > combines the results to return documents. I'm assuming fields are given > scores, and those scores are simply averaged to return the document score? > > If this is the case, then in order to incorporate multiple fields in my > scoring, I would use multiple term queries that contain the same term, but > target different fields, then I would simply put them in a boolean query, > and search my index using this boolean query. > > Am I going about this in the correct way? Any clarification would be > greatly appreciated. > > Thank you, > John B >
Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene
Hi Baris, The idea with PhraseWildcardQuery is that you can mix literal "exact" terms with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using addTerm is for exact terms, while addMultiTerm is for things that may match a number of possible terms in the given position. If you want to search for term1 followed by any term that starts with a given character, I would suggest using: int maxMultiTermExpansions = ...; // Discussed below PhraseWildCardQuery.Builder builder = new PhraseWildcardQuery("field", maxMultiTermExpansions); builder.addTerm(new BytesRef("term1")); // Add fixed term in position 0 builder.addMultiTerm(new PrefixQuery(new Term("field", "term2FirstChar"))); // Add multiterm in position 1 Query q = builder.build(); The PrefixQuery effectively gets expanded into a bunch of possible terms, based on the term dictionary on each index segment. To avoid expanding to cover too many terms (say, if you added a bunch of WildcardQuery), maxMultiTermExpansions serves as a guard rail, to put a rough bound on memory consumption and query execution time. If you're interested in details of how the maxMultiTermExpansions budget is distributed across MultiTerms, check out PhraseWildcardQuery.createWeight. If you're just running an experiment in your IDE, you could probably set maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running in a production environment, it's likely a good idea to tune it down based on your memory/latency constraints.) Incidentally, for tracking down the source code for anything in Lucene, it's probably better to go to GitHub for the most up-to-date source: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java . Hope that helps, Michael On Thu, 13 Feb 2020 at 12:29, wrote: > Hi,- > > i hope everyone is doing great. > > if i want to do the following search with PhraseWildCardQuery and > thanks to this forum for letting me know about this class (Especially to > David and Bruno) > > term1 term2FirstChar* > > i need to do two ways: (i found the source code at > > https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java > ) > > /* > > maxMultiTermExpansions - The maximum number of expansions across all > multi-terms and across all segments. It counts expansions for each > segments individually, that allows optimizations per segment and unused > expansions are credited to next segments. This is different from > MultiPhraseQuery and SpanMultiTermQueryWrapper which have an expansion > limit per multi-term. > > segmentOptimizationEnabled - Whether to enable the segment optimization > which consists in ignoring a segment for further analysis as soon as a > term is not present inside it. This optimizes the query execution > performance but changes the scoring. The result ranking is preserved. > > */ > > > 1st way: > > PhraseWildCardQuery.Builder builder = PharseWildCardQuery.Builder(field, > 2 _*/<<< i dont know what number to use here for > maxMultiTermExpansions>>>/*_, true/*boolean segmentOptimizationEnabled*/) > > pwcqBuilder.addTerm(field, new Term(field, "term1")); > > pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar")); > > PhraseWildCardQuery pwcq = pwcqBuilder.build(); > > or > > 2nd way: > > pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning {field, > "term1"} and {field ,"term2FirstChar"}); > > PhraseWildCardQuery pwcq = pwcqBuilder.build(); > > > Then this pwcq object will be fed into IndexSearcher's as the query > parameter. > > > Now, it looks like the first way will not consider expansions or in > other words wildcard? Am i right? > > i also need to understand this maxMultiTermExpansions parameter better. > For instance if first way is used, will maxMultiTermExpansions be > meaningful? > > > Thanks > >
Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene
In your example, it looks like you wanted the second term to match based on the first character, or prefix, of the term. While you could use a WildcardQuery with a term value of "term2FirstChar*", PrefixQuery seemed like the simpler approach. WildcardQuery can handle more general cases, like if you want to match on something like "a*b*c". Technically, the PrefixQuery compiles down to a slightly simpler automaton, but I only figured that out by writing a simple unit test: public void testAutomata() { Automaton prefixAutomaton = PrefixQuery.toAutomaton(new BytesRef("a")); Automaton wildcardAutomaton = WildcardQuery.toAutomaton(new Term("foo", "a*")); System.out.println("PrefixQuery(\"a\")"); System.out.println(prefixAutomaton.toDot()); System.out.println("WildcardQuery(\"a*\")"); System.out.println(wildcardAutomaton.toDot()); } That produces the following output: PrefixQuery("a") digraph Automaton { rankdir = LR node [width=0.2, height=0.2, fontsize=8] initial [shape=plaintext,label=""] initial -> 0 0 [shape=circle,label="0"] 0 -> 1 [label="a"] 1 [shape=doublecircle,label="1"] 1 -> 1 [label="\\U-\\U00ff"] } WildcardQuery("a*") digraph Automaton { rankdir = LR node [width=0.2, height=0.2, fontsize=8] initial [shape=plaintext,label=""] initial -> 0 0 [shape=circle,label="0"] 0 -> 1 [label="a"] 1 [shape=doublecircle,label="1"] 1 -> 2 [label="\\U-\\U0010"] 2 [shape=doublecircle,label="2"] 2 -> 2 [label="\\U-\\U0010"] } On Tue, 18 Feb 2020 at 13:52, wrote: > Michael and Forum,- > Thanks for thegreat explanations. > > one question please: > > why is PrefixQuery used instead of WildCardQuery in the below snippet? > > Best regards > > > On Feb 17, 2020, at 3:01 PM, Michael Froh wrote: > > > > Hi Baris, > > > > The idea with PhraseWildcardQuery is that you can mix literal "exact" > terms > > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using addTerm is > > for exact terms, while addMultiTerm is for things that may match a number > > of possible terms in the given position. > > > > If you want to search for term1 followed by any term that starts with a > > given character, I would suggest using: > > > > int maxMultiTermExpansions = ...; // Discussed below > > PhraseWildCardQuery.Builder builder = new PhraseWildcardQuery("field", > > maxMultiTermExpansions); > > builder.addTerm(new BytesRef("term1")); // Add fixed term in position 0 > > builder.addMultiTerm(new PrefixQuery(new Term("field", > "term2FirstChar"))); > > // Add multiterm in position 1 > > Query q = builder.build(); > > > > The PrefixQuery effectively gets expanded into a bunch of possible terms, > > based on the term dictionary on each index segment. To avoid expanding to > > cover too many terms (say, if you added a bunch of WildcardQuery), > > maxMultiTermExpansions serves as a guard rail, to put a rough bound on > > memory consumption and query execution time. If you're interested in > > details of how the maxMultiTermExpansions budget is distributed across > > MultiTerms, check out PhraseWildcardQuery.createWeight. If you're just > > running an experiment in your IDE, you could probably set > > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running in a > > production environment, it's likely a good idea to tune it down based on > > your memory/latency constraints.) > > > > Incidentally, for tracking down the source code for anything in Lucene, > > it's probably better to go to GitHub for the most up-to-date source: > > > https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$ > > . > > > > Hope that helps, > > Michael > > > >> On Thu, 13 Feb 2020 at 12:29, wrote: > >> > >> Hi,- > >> > >> i hope everyone is doing great. > >> > >> if i want to do the following search with PhraseWildCardQuery and > >> thanks to this forum for letting me know about this class (Especially to > >> David and Bruno) > >> > >> term1 term2FirstChar* > >> > >> i need to do two ways: (i found the source code at > >
Re: What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET
Those words ( https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java#L44-L49) have been moved to EnglishAnalyzer ( https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.4.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L47-L51 ). On Mon, 24 Feb 2020 at 15:56, wrote: > Hi,- > > I hope everyone is doing great. > > What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET? > > > https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html#STOP_WORDS_SET > > > https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html > > Best regards > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: How can I boost score of a document if two consecutive terms match
Hi John, What you're looking for sounds like Solr's pf2 parameter (see https://lucene.apache.org/solr/guide/8_6/the-extended-dismax-query-parser.html#extended-dismax-parameters and https://lucene.apache.org/solr/guide/8_6/the-dismax-query-parser.html#pf-phrase-fields-parameter for details). Basically, behind the scenes, it takes successive pairs of terms, and treats them as boosted phrase query clauses. So, a query like "t1 t2 t3" with a pf2 boost of 5 would become roughly: t1 OR t2 OR t3 OR "t1 t2"^5 OR "t2 t3"^5 Alternatively, since it sounds like you want to boost matches where two consecutive words are both present in the same document, rather than requiring that they're present in order, you could parse the query to: t1 OR t2 OR t2 OR (t1 AND t2)^5 OR (t2 AND t3)^5 Are you using a QueryParser implementation or are you just running the query string through an Analyzer and producing your own BooleanQuery? If the latter, you could directly produce the second query (wrapping the nested AND queries in a BoostQuery). Would that do what you want? Michael On Fri, Oct 30, 2020 at 2:15 PM YAN PAN wrote: > Hi there, > I recently am developing my own search based on lucene, here is the use > case I am concerned about. > > we have two documents in the index > a) content:new jersey > b) content:new year > > the query is "he is celebrating the new year in jersey city". > > > If I tokenize the queries and add all terms to a boolean query, the > document will have the same score for the two queries, but what I want is > that b scores higher than a, what similarity should I use, or how can I > tweak the internal of Lucene to achieve the goal? > > Please note that I cannot extract the phrase "new year" at compile time, so > it seems to me that PhraseQuery is not an approach. > > Thank you very much for the help! > John >
Re: DisjunctionMinQuery
Hi Marc, Can you clarify what the semantics of a DisjunctionMinQuery would be? Would you keep the score for the *lowest* scoring disjunct (plus some tiebreaker applied to the other matching disjuncts)? I'm trying to imagine how that would work compared to the classic DisMax use-case. Say I'm searching for "dalmatian" using a DisMax query over term queries against title and body. A match on title is probably going to score higher than a match against the body, just because the title has a shorter length (and the doc frequency of individual terms in the title is likely to be lower, since there are fewer terms overall). With DisMax, a match on title alone will score higher than a match on body, and the tie-break will tend to score a match on title and body higher than a match on title alone. With a DisMin (assuming you keep the lowest score), then a match on title and body would probably score lower than a match on title alone. That feels weird to me, but I might be missing the use-case. How would you use a DisMinQuery? Thanks, Froh On Wed, Nov 8, 2023 at 10:50 AM Marc D'Mello wrote: > Hi all, > > I noticed we have a DisjunctionMaxQuery > < > https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java > > > but > not a corresponding DisjunctionMinQuery. I was just wondering if there was > a specific reason for that? Or is it just that it is not a common query to > use? > > Thanks! > Marc >
Re: get distinct values from indexreader for given field
Hello! Instead of MultiFields.getFields(), you can use MultiTerms.getTerms(reader, fieldname) to get the Terms instance. To decode your long / int values, you should be able to use LongPoint/IntPoint.unpack to write the values into an array: long[] val = new long[1]; // Assuming 1-D values LongPoint.unpack(value, 0, val); values.add(val[0]); Hope that helps, Froh On Wed, Nov 22, 2023 at 11:09 AM wrote: > Hello, > > In Lucene 6 I was doing this to get all values for a given field > knowing its type: > > public List getDistinctValues(IndexReader reader, String fieldname, > Class type) throws IOException { > > List values = new ArrayList(); > Fields fields = MultiFields.getFields(reader); > if (fields == null) return values; > > Terms terms = fields.terms(fieldname); > if (terms == null) return values; > > TermsEnum iterator = terms.iterator(); > > BytesRef value = iterator.next(); > > while (value != null) { > if (type == Long.class) { > values.add(LegacyNumericUtils.prefixCodedToLong(value)); > } else if (type == Integer.class) { > values.add(LegacyNumericUtils.prefixCodedToInt(value)); > } else if (type == Boolean.class) { > values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ? > TRUE : FALSE); > } else if (type == Date.class) { > values.add(new > Date(LegacyNumericUtils.prefixCodedToLong(value))); > } else if (type == String.class) { > values.add(value.utf8ToString()); > } else { > // ... > } > > value = iterator.next(); > } > > return values; > } > > I am trying to upgrade to lucene 9. > there were 2 changes over time: > - LegacyNumericUtils has been removed in favor of PointBase > - MultiFields.getFields() has been dropped, and I read we were encouraged > to avoid fields in general > > what is proper way to implement getting distinct values for a specific > field in a reader? > > thanks for your help, > > vs >
Re: get distinct values from indexreader for given field
Oh -- of course if you're using IntPoint / LongPoint for your numeric fields, they won't be indexed as terms, so loading terms for them won't work. It's not the prettiest solution, but I think the following should let you collect the set of distinct point values for an IntPoint field: final Set collectedValues = new TreeSet<>(); for (LeafReaderContext lrc : reader.leaves()) { LeafReader lr = lrc.reader(); PointValues.IntersectVisitor collectingVisitor = new PointValues.IntersectVisitor() { @Override public void visit(int docID) throws IOException { } @Override public void visit(int docID, byte[] packedValue) { collectedValues.add(IntPoint.decodeDimension(packedValue, 0)); } @Override public PointValues.Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { return PointValues.Relation.CELL_CROSSES_QUERY; } }; lr.getPointValues(fieldname).intersect(collectingVisitor); } On Tue, Nov 28, 2023 at 1:42 PM Michael Froh wrote: > Hello! > > Instead of MultiFields.getFields(), you can use > MultiTerms.getTerms(reader, fieldname) to get the Terms instance. > > To decode your long / int values, you should be able to use > LongPoint/IntPoint.unpack to write the values into an array: > > long[] val = new long[1]; // Assuming 1-D values > LongPoint.unpack(value, 0, val); > values.add(val[0]); > > Hope that helps, > Froh > > > On Wed, Nov 22, 2023 at 11:09 AM wrote: > >> Hello, >> >> In Lucene 6 I was doing this to get all values for a given field >> knowing its type: >> >> public List getDistinctValues(IndexReader reader, String >> fieldname, >> Class type) throws IOException { >> >> List values = new ArrayList(); >> Fields fields = MultiFields.getFields(reader); >> if (fields == null) return values; >> >> Terms terms = fields.terms(fieldname); >> if (terms == null) return values; >> >> TermsEnum iterator = terms.iterator(); >> >> BytesRef value = iterator.next(); >> >> while (value != null) { >> if (type == Long.class) { >> values.add(LegacyNumericUtils.prefixCodedToLong(value)); >> } else if (type == Integer.class) { >> values.add(LegacyNumericUtils.prefixCodedToInt(value)); >> } else if (type == Boolean.class) { >> values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ? >> TRUE : FALSE); >> } else if (type == Date.class) { >> values.add(new >> Date(LegacyNumericUtils.prefixCodedToLong(value))); >> } else if (type == String.class) { >> values.add(value.utf8ToString()); >> } else { >> // ... >> } >> >> value = iterator.next(); >> } >> >> return values; >> } >> >> I am trying to upgrade to lucene 9. >> there were 2 changes over time: >> - LegacyNumericUtils has been removed in favor of PointBase >> - MultiFields.getFields() has been dropped, and I read we were encouraged >> to avoid fields in general >> >> what is proper way to implement getting distinct values for a specific >> field in a reader? >> >> thanks for your help, >> >> vs >> >
Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect
Hi Wojtek, Thank you for linking to your test code! When you open an IndexReader, it is locked to the view of the Lucene directory at the time that it's opened. If you make changes, you'll need to open a new IndexReader before those changes are visible. I see that you tried creating a new IndexSearcher, but unfortunately that's not sufficient. Hope that helps! Froh On Fri, Aug 9, 2024 at 3:25 PM Wojtek wrote: > Hi all! > > There is an effort in Apache James to update to a more modern version of > Lucene (ref: > https://github.com/apache/james-project/pull/2342). I'm digging into the > issue as other have done > but I'm stumped - it seems that > `org.apache.lucene.index.IndexWriter#updateDocument` doesn't update > the document. > > Documentation > ( > https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)) > > states: > > > Updates a document by first deleting the document(s) containing term > and then adding the new > document. The delete and then add are atomic as seen by a reader on the > same index (flush may happen > only after the add). > > Here is a simple test with it: > > https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java > > but it fails. > > Any guidance would be appreciated because I (and others) have been hitting > wall with it :) > > -- > Wojtek > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Learning resources for Lucene Development
Hi Marc, In some shameless self-promotion, I've written up some worked Lucene examples (maybe a little more focused on Lucene internals than best practices) over at https://github.com/msfroh/lucene-university. If you have anything you'd like to understand better, feel free to open issues there and I'm happy to write up examples. (I've started getting a few other contributors, which is pretty cool.) I also used to run a weekly Lucene study group on Zoom (with a bit of a focus on OpenSearch, but still mostly aimed at Lucene). The last occurrence was in August (https://www.meetup.com/opensearch/events/302874684/). I'd love to start that back up and switch to more of a question and answer / tutorial format. We could record them and stick them up on YouTube, if that would be helpful. Thanks, Froh On Tue, Oct 8, 2024 at 7:46 PM Navneet Verma wrote: > +1 on the question. > > On Tue, Oct 8, 2024 at 6:35 PM Marc Davenport > wrote: > > > Hello, > > I had this question buried in a previous email. I feel like I have a > very > > loose grasp on the Lucene API and how to properly implement with it. I'm > > working on code that I didn't write myself from the ground up. Since I'm > > learning as I'm reading it, I can only assume things were done right. As > I > > look to improve and change our code, I don't know how to distinguish > what's > > good and bad practice. We leverage some features that don't seem to be > > rudamentary. I'm looking for any in person learning opportunities, > > training, workshops, etc that really dives into how lucene works to give > me > > a better handle on what we have vs what we could. I'm open to personal > > direct education if that's an option. Any professional development > > opportunities, consultant groups with a good education reputation, etc > > would be welcome. > > Thank you, > > Marc > > >
Re: Understanding Document ID (Lucene 10.0.0)
Hi Prashant, For your particular use-case, you probably don't need to join across multiple indices. Lucene is able to maintain multiple data structures per field, with the selection of data structures coming from attributes of the field's type. If you have a field that you want to return, but doesn't need to be searchable (like your HTML report), you can add it as an unindexed string field that's stored. That will write it to the stored fields data structure (which is used to populate search results), but won't build a full-text index for it. The slight downside of that approach is that all stored fields for a document are compressed and written together. If users mostly just want the name, age, and city fields (and only rarely care about the report field), then maybe storing it in a separate index might make sense. In that case, adding an ID keyword field to both indices is a viable option. Doing a term query on the secondary index to find the appropriate docs should generally be quite fast -- while Lucene is not primarily a key-value store, it works surprisingly well as one. Hope that helps, Froh On Fri, Oct 25, 2024 at 8:28 AM Prashant Saxena wrote: > I'm new to Lucene and trying to understand the concept of unique document > id, something like a primary key in databases like sql or sqlite etc. > While searching, I came across this article: > https://blog.mikemccandless.com/2014/05/choosing-which actually > fast-unique-identifier-uuid.html > < > https://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html > > > which actually is quite old but it's been said elsewhere, it's still > applicable on the latest version as internally things > are not changed much in Lucene. > > What have I done so far? > > I have created a simple index where few text files are written as > documents, when I open this index in Luke (GUI), on the *Document* tab, > I see an option along with a spinner control: > > *Browse document by Doc # | 0| in 100 docs* > > If you change the value, it shows the document at the bottom of the GUI. > The id seems to be a number in which documents are stored. > > *Question:* > How can one access this id? > > Why do I need a unique id? > > Let's assume I have created a simple index with three fields: name, age & > city. There is another field, the associated long html text, > which I am writing in another index. In a GUI environment where users can > search by typing the search term in four of the fields. > > name | | > age| | > city| | > report || > Usually, people are interested in the first three fields. Report field is > not used as much but still available if somebody is interested. > > *Option 1* > When the user is searching only using name, age, city, I'll open the first > index, do the search, get the documents and their ids, get the report field > directly from the second index using the id. This way > no searching is required in the second index. > > *Option 2* > I have recently started learning Lucene and right now I haven't touch the > joining part but still here is the question > If a user has given a search term in all of the four fields then logically > you have to search in both the indexes and find the common doc ids in both > searched results. > > *Question* > How will this joining happen, to get the correct results from both the > indexes? If possible please refer to some online code example links. > > Prashant >
Re: NRT segment replication in AWS
On Sun, Mar 2, 2025 at 7:21 AM Marc Davenport wrote: > > @Michael - That second simpler architecture is very similar to what we are > considering; With the exception of a queue for announcing new > segments rather than a polling process. It is good to know that it's a > reasonable outline. You were very latency sensitive. Is there anything > you can share around the most important specs of your containers, pods, or > even nodes? While we run on these EC2 instances, there is a push to get us > into k8s. Did you have any issues as you migrated from one version of > lucene to another. I'm concerned that our current deployments only allow > one version of the software in production at any one time. > > >From what I remember, we did see a bit of noisy-neighbor behavior running two searchers in separate containers on the same instance. I don't think we ever found a concrete reason (at least while I was there). We gave up and ended up using smaller instances instead of putting multiple searchers on the same instance. For migration between different Lucene versions (or even versions of our software, which might change indexing behavior), we used the process described in https://careersatdoordash.com/blog/introducing-doordashs-in-house-search-engine/ in the section "Tenant Isolation and Search Stacks". Essentially, bring up a new indexer fleet, build a new index, then start bringing up new searchers that replicate from the new indexers.
Re: NRT segment replication in AWS
Hi there, I'm happy to share some details about how Amazon Product Search does its segment replication. I haven't worked on Product Search in over three years, so anything that I remember is not particularly novel. Also, it's not really secret sauce -- I would have happily talked about it more in the 2021 re:Invent talk that Mike Sokolov and I did, but we were trying to keep within our time limit. :) That model doesn't exactly have direct communication between primary and replica (which is generally a good practice in a cloud-based solution -- the fewer node-to-node dependencies, the better). The flow (if I recall correctly) is driven by a couple of side-car components for the writers and searchers and is roughly like this: 1. At a specified (pretty coarse) interval, the writer side-car calls a "create checkpoint" API on the writer to ask it to write a checkpoint. 2. The writer uploads new segment files to S3, and a metadata object describing the checkpoint contents (which probably includes segments from an earlier checkpoint, since they can be reused). 3. The writer returns the S3 URL for the metadata object to its side-car. 4. The write side-car publishes the metadata URL to "something" -- see below for details. 5. The searcher side-cars all read the metadata URL from "something" -- see below for details. 6. The searcher side-cars each call a "use checkpoint" API on their local searchers 7. The searchers each download the new segment files from S3 and open new IndexSearchers. For the details of steps 4 and 5, I don't actually remember how it worked, but I have two pretty good guesses from what I remember of the overall architecture: 1. DynamoDB: This is the more likely mechanism. Each index shard has a unique ID which serves as a partition key in DynamoDB and there's a sequence number as a sort key. The writer side-car inserts a DynamoDB record with the next sequence number and the metadata URL. The searcher side-car periodically fetches 1 record with the partition key by descending sequence number (i.e. get latest sequence entry for the partition key). If the sequence number has increased, then call the searcher's use-checkpoint API. 2. Kinesis: This feels like the less likely mechanism, but I guess it could work. The writer side-car writes the metadata URL to a Kinesis stream. Each searcher side-car reads from the Kinesis stream and passes the metadata URL to the searcher. I'm pretty sure we didn't have one Kinesis stream per index shard, because managing (and paying for) that many Kinesis streams would be a pain. Even with a sharded Kinesis stream, you'd "leak" some checkpoints across index shards, leading to data that the searcher side-cars would throw away. Also, each Kinesis stream shard has a limited number of concurrent consumers, which would mean that the number of search replicas would be limited. I'm *pretty sure* we used the DynamoDB approach. Another Lucene-based search system that I worked on many years ago at Amazon had a much simpler architecture: 1. Writer periodically writes new segments to S3. 2. After writing the new segments, the writer writes a metadata object to S3 with a path like "/metadata.json". Because the writer was guaranteed to be the *only* thing writing with prefixes of , it could manage its own dense sequence numbers. 3. A searcher is "sticky" to a writer, and periodically issues an S3 GetObject for the next metadata object's full URL (i.e. the URL using the next dense sequence number). Until the next checkpoint is written, it gets a 404 response. 4. Searcher fetches the files referenced by the metadata file. A nice thing about that approach was that it only depended on S3 and only used PutObject and GetObject APIs, which tend to be more consistent. The downside was that we needed a separate mechanism for writer discovery and failover, to let searchers know the correct writer prefix. Hope that helps! Let me know if you need any other suggestions. Thanks, Froh On Wed, Feb 26, 2025 at 3:31 PM Steven Schlansker < stevenschlans...@gmail.com> wrote: > > > > On Feb 26, 2025, at 2:53 PM, Marc Davenport > > > wrote: > > > > Hello, > > Our current search solution is a pretty big monolith running on pretty > > beefy EC2 instances. Every node is responsible for indexing and serving > > queries. We want to start decomposing our service and are starting with > > separating the indexing and query handling responsibilities. > > We run a probably comparatively small but otherwise similar installation, > using > Google Kubernetes instances. We just use a persistent disk instead of an > elastic store, but > also would consider using something like S3 in the future. > > > I'm in the research phases now trying to collect any prior art I can. The > > rough sketch is to implement the NRT two replication node classes on > their > > respective services and use S3 as a distribution point for the segment > > files. I'm still debating if there should be some direct knowledge o
Re: Synonym graph and multiple values
This relates to the "position increment gap" for your analyzer and is configurable. If you check the JavaDoc for Analyzer#getPositionIncrementGap, it says: * Invoked before indexing a IndexableField instance if terms have already been added to that * field. This allows custom analyzers to place an automatic position increment gap between * IndexbleField instances using the same field name. The default value position increment gap is * 0. With a 0 position increment gap and the typical default token position increment of 1, all * terms in a field, including across IndexableField instances, are in successive positions, * allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries. So, if you want "ab", "c d" to behave the same as "a b c d", you would use the default gap of 0. If you want them to behave differently, you can add a gap between successive values to prevent matching across them. Essentially, the position increment gap adds some number of "holes" (empty positions) between values. So, if you add a gap of 10, then the terms for "a b", "c d" would be in the following positions, I believe: 0 a 1 b 12 c 13 d Phrase matching works by checking if the term positions differ by the appropriate amount. If you have stop word removal, the above example might match the phrase "b the the the the the the the the the the c", because the "thes" (I write, as I'm currently wearing a t-shirt from the band "The The") would also map to empty positions. Hope that helps, Froh On Tue, Mar 25, 2025 at 9:47 AM Kai Grossjohann wrote: > Hi, > > I'd like to understand more about how multiple values of a field are > handled. Consider a Lucene document with a field foo that has a single > value “a b c d” versus another Lucene document where the field foo has > two values, namely “a b” and “c d”. > > When using Synonym Graph (so that synonym phrases are supported), and > supposing I have a synonym phrase “b c”... > > * I suppose the Lucene document with the single value “a b c d” > matches this synonym phrase, but > * does the other document match this phrase, as well? > > In a similar vein, how to phrase queries behave? If I query for the > phrase “b c” will the two-value document match? > > Thanks, > Kai >
Re: Sub-Graphs in Hnsw
I'm wondering if this is the same idea that Kaival is proposing in https://github.com/apache/lucene/issues/14758 (Support multiple HNSW graphs backed by the same vectors). On Thu, Jun 5, 2025 at 11:32 AM Michael Sokolov wrote: > I do think there could be many interesting use cases for building > multiple graphs from a single set of vectors. For example, one might > want to sometimes search all the docs, sometimes search the one subset > and other times another subset; baking the constraint into the graph > construction would be lead to more efficient searches than the other > graph search filtering we can do today (pre- and post-filtering) and > there could be use cases where the constraints are so very often > present that we would want to pay the up-front cost of computing > multiple graphs without paying the cost of storing the same vectors > multiple times in the index. This isn't supported today but I think > would be a welcome contribution. > > On Wed, Jun 4, 2025 at 3:51 AM Ravikumar Govindarajan > wrote: > > > > > > > > I wonder if you could influence the graph search by incorporating the > > > partition key (customer id?) to the vectors somehow? If this was done > > > well it should lead to a natural clustering of the graph. > > > > > > > I can explore further on this. Thanks for the pointers.. > > > > On Mon, Jun 2, 2025 at 11:14 PM Michael Sokolov > wrote: > > > > > I wonder if you could influence the graph search by incorporating the > > > partition key (customer id?) to the vectors somehow? If this was done > > > well it should lead to a natural clustering of the graph. > > > > > > On Mon, Jun 2, 2025 at 11:32 AM Ravikumar Govindarajan > > > wrote: > > > > > > > > Hi Michael, > > > > > > > > The docs range could vary in extremes from few 10s to > tens-of-thousands > > > > and in very heavy usage cases, 100k and above… in a single segment > > > > > > > > Filtered Hnsw like you said uses a single graph.., which could be > better > > > if > > > > designed as sub-graphs > > > > > > > > On Mon, 2 Jun 2025 at 5:42 PM, Michael Sokolov > > > wrote: > > > > > > > > > How many documents do you anticipate in a typical sub range? If > it's > > > in the > > > > > hundreds or even low thousands you would be better off without > hnsw. > > > > > Instead you can use a function score query based on the vector > > > distance. > > > > > For larger numbers where hnsw becomes useful, you could try using > > > filtered > > > > > hnsw, but this will be using a single graph constructed from all > of the > > > > > documents. > > > > > > > > > > On Mon, Jun 2, 2025, 5:25 AM Ravikumar Govindarajan < > > > > > ravikumar.govindara...@gmail.com> wrote: > > > > > > > > > > > We use index-sorting to arrange segment data. The ord-ranges for > any > > > > > given > > > > > > KnnVectorField is mutually exclusive > > > > > > > > > > > > Ex: > > > > > > field: content > > > > > > > > > > > > OrdRange -> 0-100 (User1) > > > > > > OrdRange -> 101-300 (User2) > > > > > > and so on.. > > > > > > > > > > > > Each OrdRange has to be a self-contained Hnsw graph with all > > > neighbours > > > > > > strictly inside the given OrdRange. A sub-graph, to be precise.. > The > > > > > > generated segment will contain a lot of these sub-graphs but > without > > > any > > > > > > neighbour links to each other at Level-0. Level-1 and above can > have > > > > > > cross-links, which should be fine.. > > > > > > > > > > > > Searches will be based on OrdRange and should stop once the > > > sub-graph is > > > > > > fully explored and not cross over to other sub-graphs.. > > > > > > > > > > > > I can index them as different fields but it could run into a few > > > hundreds > > > > > > (if not thousands). > > > > > > > > > > > > Are there any strategies I can adopt to accomplish this? Can a > custom > > > > > > VectorScoringFunction solve this? (Like -> assign actual score, > if > > > ords > > > > > are > > > > > > in range. Assign 0, if out-of-range etc..) > > > > > > > > > > > > Is this the correct way of looking at the problem? > > > > > > > > > > > > Any help is much appreciated > > > > > > > > > > > > Regards, > > > > > > Ravi > > > > > > > > > > > > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: need help with JoinUitl.createJoinQuery() method
It looks like your pk_p and pk_c fields aren't indexed -- they just have doc values. If you try making them KeywordFields instead (so they're indexed and have doc values), does it work? Also, the join module may be overkill for what you're trying to do, since it looks like you're indexing parent/child blocks anyway. You could use ToParentBlockJoinQuery instead (though you'd need to index the child docs together with the parent doc using IndexWriter#addDocuments). Of course, maybe your real use-case is more complicated, so the join module makes sense. On Wed, Jul 23, 2025 at 12:02 PM Markos Zaharioudakis wrote: > Hi, > > I am trying to create and run a "join" query using JoinUitl.createJoinQuery() > in Lucene 10.2.0. However, the query returns 0 results. I am attaching my > little test program. Can you please tell me what I am doing wrong? > > Thanks a lot, > Markos. > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org