is sounds doable, but we never got to it.
>
>
>
> On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov wrote:
>
> > Without this temp file we would need to load the entire set of vectors
> > for the new merged segment into RAM in order to support building an
> > HNSW gra
Without this temp file we would need to load the entire set of vectors
for the new merged segment into RAM in order to support building an
HNSW graph from it. This way we can read the vectors off the disk in
the same way we would do during normal searches. I'm not sure, but I
think the temp file s
wrote:
>
> I'm wondering if this is the same idea that Kaival is proposing in
> https://github.com/apache/lucene/issues/14758 (Support multiple HNSW graphs
> backed by the same vectors).
>
> On Thu, Jun 5, 2025 at 11:32 AM Michael Sokolov wrote:
>
> > I do think there c
key (customer id?) to the vectors somehow? If this was done
> > well it should lead to a natural clustering of the graph.
> >
>
> I can explore further on this. Thanks for the pointers..
>
> On Mon, Jun 2, 2025 at 11:14 PM Michael Sokolov wrote:
>
> > I wonder i
e docs range could vary in extremes from few 10s to tens-of-thousands
> and in very heavy usage cases, 100k and above… in a single segment
>
> Filtered Hnsw like you said uses a single graph.., which could be better if
> designed as sub-graphs
>
> On Mon, 2 Jun 2025 at 5:42 PM, Mic
How many documents do you anticipate in a typical sub range? If it's in the
hundreds or even low thousands you would be better off without hnsw.
Instead you can use a function score query based on the vector distance.
For larger numbers where hnsw becomes useful, you could try using filtered
hnsw,
The message is telling you that you previously indexed the field
boe.search.wild_description with offsets and now you are trying to
index it without offsets. This probably indicates you are using a
different Analyzer, which is generally *not ok* since indexed fields
must be indexed in a consistent
You can combine queries; they are composable. Whether it makes sense
or not for your use case is something you will have to decide. To me
it's hard to see a case where vector query 1 AND vector query 2 would
be preferable to combining the vectors "up front" (ie when creating
the vectors), but mayb
One thing to check is whether the synonyms are configured as
bidirectional, or which direction they go (eg is "a b" being expanded
to "ab" but "ab" is not being expanded to "a b"??)
On Wed, Mar 5, 2025 at 2:20 PM Mikhail Khludnev wrote:
>
> Hello Trevor.
>
> Maintaining such a synonym map is too
Stored fields is a separate format that stores data in a row-wise
fashion: all the stored data for a single document is written
together. Vectors aren't *also* copied into stored fields storage, so
the stored fields API can't be used to retrieve them. If we did allow
that it would result in massiv
Sparse is meaning two different things here. In the case you found Mikhail,
it means not every document has a value for some vector field. I think the
question here is about very high dimensional vectors where most documents
have zeroes in most dimensions of the vector.
On Tue, Dec 3, 2024, 2:01 A
Another way is using postings - you can represent each dimension as a
term (`dim0`, `dim1`, etc) and index those that occur in a document.
To encode a value for a dimension you can either provide a custom term
frequency, or index the term multiple times. Then when searching you
can form a BooleanQu
That's interesting! One thing I'd say is we don't want to be
optimizing for the random vector use case, so from that perspective
this is less concerning. However we also don't want to have poor
worst-case performance, so we should address this somehow. If you want
to probe for degenerate cases, yo
Do you actually use org.apache.lucene.replicator.http ? If not then
this wouldn't have any material impact on your application.
On Mon, Oct 28, 2024 at 4:25 AM Renaud SAINT-GRATIEN
wrote:
>
> CONFIDENTIAL
>
> Hello,
>
> Is there any plan to patch Lucene 8.11 for CVE-2024-45772 ?
> I need to stay
I think this might be a better question for solr-user@? EG I don't
understand how Solr decides which Query to send to populateScores --
is it the same one that was used to generate the matches in topDocs?
It seems as if it should be, but then this error shouldn't happen ...
I wonder if you can prin
> If your two indexes load data sequentially and in the same order, then I
believe that you would get the same results. But we consider this an
implementation detail rather than a guarantee that Lucene should have.
You might even still be surprised by nondeterminism arising from
concurrency during
Yes, there is no support for upgrading a pre-8.x index to 9 or later.
At some point it was decided that supporting that would lead to grief
for users and/or hamper development of Lucene, so now you can only
upgrade one major version. If you need to do so, the best supported
option is to write a pro
You could switch to DocValues, and it would probably be more efficient
if you are only retrieving a single stored field but you have a lot of
other ones in the index since stored fields are stored together and
have to be decoded together. As far as visiting every segment on disk
I'm not sure what
Neat!
On Thu, Jun 6, 2024, 2:57 AM Balog Tamás
wrote:
> Dear Lucene Community,
> Since Tuesday, the IntelliJ plugin called [Lucas](
> https://plugins.jetbrains.com/plugin/24567-lucas) is available on the
> JetBrains Marketplace.
>
> It integrates / ports the Luke toolbox to the IntelliJ Platform
th a code
> > search).
> > We also always merge down to one segment (historical but also we index
> > once and then there are no changes for a week to a month and then we
> > reindex every document from scratch).
> >
> > Your response is very helpful already and
It seems as if the term frequency for some term exceeded the maximum.
This can happen if you supplied custom term frequencies eg with
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
. The behavior didn't change since
I also found this helpful documentation by looking in the source code
of SearchFiles.java: https://lucene.apache.org/core/9_10_0/demo/
On Mon, Apr 22, 2024 at 4:40 AM Stefan Vodita wrote:
>
> Hi Siddharth,
>
> If you happen to be using IntelliJ, you can run a demo class from the IDE.
> It probabl
To get best results it's necessary to tune these parameters for each vector
model. My suggestion is to use a subset of your 100M vectors for parameter
optimization to save time while iterating through the parameters space as
you will indeed need to reindex in order to measure
Generally speaking, i
> In Lucene scores should go up for more relevancy.
That is the case for combining child scores with min. min() is monotonic --
if its arguments increase, the result does not decrease, it only stays the
same or increases, so I think it is a valid scoring operation for Lucene.
And it makes some log
another thing to check beyond whether the correct documents are
matched is whether the correct score is returned. I'm not sure
actually how it works but I can imagine that a query for "red red
wine" would produce a higher score for documents having "red red wine"
than it would for documents having
well, it is "approximate" KNN and can get caught in local minima
(maxima?). Increasing K has, indirectly, the effect of expanding the
search space because the minimum score in the priority score (score of
the Kth item) is used as a threshold for deciding when to terminate
the search
On Wed, Aug 2,
Hi Deepika, that would be a welcome addition - we had an earlier
discussion about it; see the thread here:
https://markmail.org/message/hq7jvobsnxwp7iat
Please be careful not to copy the code from Elastic as it is not
shared under an open license that permits copying
On Wed, May 24, 2023 at 3:19
You might also want to have a look at FeatureField. This can be used
to associate a score with a particular term.
On Thu, May 11, 2023 at 1:13 PM Hrvoje Lončar wrote:
>
> I had a situation when i wanted to sort a list of articles based on the
> amount of data entered. For example, article having
3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L281
>
> Thanks,
> Wei
>
>
> On Thu, May 4, 2023 at 11:47 AM Michael Sokolov wrote:
>
> > Yes, sorry I didn't mean to imply you couldn't control this if you
&
with early termination. Do you think this makes sense? Any
> > suggestion is appreciated.
> >
> > Thanks,
> > Wei
> >
> > On Thu, May 4, 2023 at 3:33 AM Michael Sokolov wrote:
> >
> > > There is no meaning to the sequence. The segments are created
> &
There is no meaning to the sequence. The segments are created concurrently
by many threads and the merge process will merge them without regards to
any ordering.
On Wed, May 3, 2023, 1:09 PM Patrick Zhai wrote:
> For that part I'm not entirely sure, if other folks know it please chime in
> :)
Lucene is licensed under the Apache license, just as it says in the
LICENSE file. junit is used for testing Lucene and is not
redistributed with it. Using Lucene in your code does not mean you are
using junit, except in some extremely philosophical sense. EG Lucene
developers may have developed Luc
Sorry your problem statement makes no sense: you should be able to
store field data in the index without loading all your documents into
RAM while indexing. Maybe there is some constraint you are not telling
us about? Or you may be confused. In any case highlighting requires
the document in its uni
I would suggest building Lucene from source and adding your own
similarity function to VectorSimilarity. That is the proper extension
point for similarity functions. If you find there is some substantial
benefit, it wouldn't be a big lift to add something like that. However
I'm dubious about the li
That label seems to be something GitHub created automatically?
You might have better luck browsing the full list of labels. I found these:
https://github.com/apache/lucene/labels/legacy-jira-label%3Anewbie
https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev
https://github.com/apach
+1 trying to coordinate multiple writers running independently will
not work. My 2c for availability: you can have a single primary active
writer with a backup one waiting, receiving all the segments from the
primary. Then if the primary goes down, the secondary one has the most
recent commit repli
Have you tried escaping with a backslash? I have a vague memory that
might work. As for modifying classes in 4.10.4, you are welcome to do
so in a custom fork, but that version is so old that we no longer post
fixes for it on the official Apache release branches. The current
release series is 9.x -
The error you got
BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)
indicates that the index you are reading was written by Lucene 9, so
things are not set up the way you described (writing using Lucene 7)
> Thanks TX
I'd agree with the main point re: the need to combine vector-based
matching with term-based matching.
As for the comparison with Lucene, I'd say it's a shallow and biased
take. The main argument is that Vespa's mutable in-memory(?) data
structures are superior to Lucene's immutable on-disk segment
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.0.
Apache Lucene is a high-performance, full-featured search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires structured search, full-text
search, faceting, nearest-n
I think it depends how precise you want to make the search. If you
want to enable diacritic-sensitive search in order to avoid confusions
when users actually are able to enter the diacritics, you can index
both ways (ascii-folded and not folded) and not normalize the query
terms. Or you can just fo
ooh
On Fri, Sep 23, 2022 at 11:02 AM Adrien Grand wrote:
>
> We have a TruncateTokenFilter in lucene/analysis/common. :)
>
> On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov wrote:
>
> > I wonder if it would make sense to provide a TruncationFilter in
> > addition to
I wonder if it would make sense to provide a TruncationFilter in
addition to the LengthFilter. That way long tokens in source text
could be better supported, albeit with some confusion if they share
the same very long prefix...
On Fri, Sep 23, 2022 at 9:56 AM Scott Guthery wrote:
>
> Thanks much,
no, and I think it could be challenging to go the route of using
Dalvik/ART. Maybe you can run an actual JDK on Android? See
https://openjdk.org/projects/mobile/android.html
On Fri, Sep 9, 2022 at 9:27 AM Jie Wang wrote:
>
> Hey,
>
> Recently, I am trying to compile the Lucene to get a jar that c
Thanks! It seems to be working nicely.
Question about the fix-version: tagging. I wonder if going forward we
want to main that for new issues? I happened to notice there is also
this "milestone" feature in github -- does that seem like a place to
put version information?
On Wed, Aug 24, 2022 at 3
https://home.apache.org/~mikemccand/lucenebench/ shows how various
benchmarks have evolved over time *on the main branch*. There is no
direct comparison of every version against every other version that I
have seen though.
On Tue, Jul 26, 2022 at 2:12 PM Baris Kazar wrote:
>
> Dear Folks,-
> Sim
Oh good! Thanks for clarifying, Uwe
On Sat, Jul 9, 2022, 12:23 PM Uwe Schindler wrote:
> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that wou
I am no expert with this, but I got curious and looked at
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect. So
Lucene core is a no-dependencies library. Some of the other Lucene
modules, and the build and tests, have dependencies, but none of them
includes log4j. So sorry, but we won't be making Lucene use log4j
2.17.2; probably you should get your compliance standards changed to
include *forbidden* version
OK I replied on the issue. This ann-benchmarks is a separate project,
and I think you are asking about how to change it. Probably should
take it up with erikbern or whatever community is supporting that
actively. I just created a "plugin" so we could use it to test
Lucene's KNN implementation, but
thanks, I fixed the doc!
On Tue, Apr 26, 2022 at 9:13 AM Bridger Dyson-Smith
wrote:
>
> Hi Michael -
>
> On Mon, Apr 25, 2022 at 5:38 PM Michael Wechner
> wrote:
>
> > Hi Bridger
> >
> > Inside
> >
> > https://dlcdn.apache.org/lucene/java/9.1.0/lucene-9.1.0.tgz
> >
> > you should find
> >
> > mo
Looking at git blame I see the current parameter was added here:
https://issues.apache.org/jira/browse/LUCENE-6648. Previous
implementations supported a BitSet rather than a Query. I'm not really
sure what the use case is for applying additional filtering when
faceting. Perhaps it can support somet
Another approach for retrieving large result sets can work if you have
a unique sort key. and don't mind retrieving your results sorted by
this key. Then you can retrieve the results in batches using a
cursor-style approach; request the top N sorted by the key. Then
request the top N s.t. the key i
HI Marc, I wonder if there is a workaround for this issue: eg, could
we have entries for both widths? I wonder if there is some interaction
with an analysis chain that is doing half-width -> full-width
conversion (or vice versa)? I think the UserDictionary has to operate
on pre-analyzed tokens ...
I think the "broken offsets" refers to offsets of tokens "going
backwards". Offsets are attributes of tokens that refer back to their
byte position in the original indexed text. Going backwards means -- a
token with a greater position (in the sequence of tokens, or token
graph) should not have a le
Strictly speaking, we could have opened an older index using Lucene 8
(say one that was created using Lucene 7, or 6) that would no longer
be valid in Lucene 9, at least according to the policy? I agree we
should try to fix this, just want to clarify the policy
On Tue, Dec 14, 2021 at 8:54 AM Adri
I wonder if the Analysis chain could be involved. If those stop words
("is") are removed without leaving a hole somehow, then that could
explain?
On Mon, Dec 13, 2021 at 9:35 AM Michael McCandless
wrote:
>
> Hello Claude,
>
> Hmm, that is interesting that you see slop=2 matching query "quick fox"
I think you are asking how to re-sort a result set returned from
IndexSearcher.search, ie a TopDocs? You can do this with one of the various
Rescorers. Have you looked at those?
On Tue, Nov 30, 2021, 9:15 AM Luís Filipe Nassif
wrote:
> Hi Lucene community,
>
> Our users could do very heavy searc
x27;ll try to reproduce the hang first and then try to get the JVM logs.
> > > I'll
> > > > respond back here if I find something useful.
> > > >
> > > > > Do you get this error in lucene:core:ecjLintMain and not during
> > > compile?
>
ure as well?
>
> Thanks again!
> Kevin
>
> On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov wrote:
>
> > > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and
> >
> I would a bit careful: On our Jenkins server running with AMD Ryzen CPU it
> happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and stay
> unkillable (only a hard kill with" kill -9"). Previous Java versions don't
> hang. It happens not all the time (about 1/4th of all builds
Yeah, index sorting doesn't do that -- it sorts *within* each segment
so that when documents are iterated (within that segment) by any of
the many DocIdSetIterators that underlie the Lucene search API, they
are retrieved in the order specified (which is then also docid order).
To achieve what you
Thank you for offering to add to the FAQ! Indeed it should mention the
suggester capability. I think you have permissions to edit that wiki?
Please go ahead and I think add a link to the suggest module javadocs
On Thu, Oct 7, 2021 at 2:30 AM Michael Wechner
wrote:
>
> Thanks very much for your fe
Ah sorry never mind. Confused collector and collector manager
On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov wrote:
> Separate issue, but this collector is not going to work with concurrent
> search since the sum is not updated in a thread safe manner. Maybe you
> don't care, since
Separate issue, but this collector is not going to work with concurrent
search since the sum is not updated in a thread safe manner. Maybe you
don't care, since you don't use a thread pool to execute your queries, but
you probably should!
On Wed, Sep 22, 2021, 8:38 AM Adrien Grand wrote:
> Hi St
query, and rely on log(a)+log(b) = log(a * b).
>
> Le ven. 17 sept. 2021 à 14:47, Michael Sokolov a
> écrit :
>
> > Not advocating any particular approach here, just curious: could BMW
> > also function in the presence of a doc-score (like recency) that is
> > multi
Not advocating any particular approach here, just curious: could BMW
also function in the presence of a doc-score (like recency) that is
multiplied? My vague understanding is that as long as the scoring
formula is monotonic in all of its inputs, and we have block-encoded
the inputs, then we could c
nt hits(doc hits
> matching 2 USD with 150 INR records). Any pointers to know about this in
> detail?
>
>
> Kumaran R
> Chennai, India
>
>
>
> On Fri, Sep 3, 2021 at 12:08 AM Michael Sokolov wrote:
>
> > Have you looked at the expressions module? It pr
Have you looked at the expressions module? It provides support for
user-defined computation using values from the index based on a simple
expression language. It might prove useful to you if the exchange rate
needs to be tracked very dynamically.
On Thu, Sep 2, 2021 at 2:15 PM Kumaran Ramasubraman
I think the usual usage pattern is to *refresh* frequently and commit
less frequently. Is there a reason you need to commit often?
You may also have overlooked this newish method: MergePolicy.findFullFlushMerges
If you implement that, you can tell IndexWriter to (for example) merge
multiple small
... should *reindex* ( not update )
On Thu, May 27, 2021 at 10:39 AM Michael Sokolov wrote:
>
> LGTM, but perhaps also should state that if possible you *should*
> update because the 8.x index may not be able to be read by the
> eventual 10 release.
>
> On Thu, May 27, 2021
work :-)
> >
> > Thank you very much!
> >
> > But IIUC it is recommended to reindex when upgrading, right? I guess
> > similar to what Solr is recommending
> >
> > https://solr.apache.org/guide/8_0/reindexing.html
> >
> >
> > Am 26.05.21
This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs! I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising
I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
to read 8.x indexes.
On Wed, May 26, 2021 at 9:27 AM Michael Wechner
wrote:
>
> Hi
>
> I am using Lucene 8.8.2 in production and I am currently doing some
> tests using 9.0.0-SNAPSHOT, whereas I have included
> lucene-backward-
Hi Michael, that is fully-functional in the sense that Lucene will
build an HNSW graph for a vector-valued field and you can then use the
VectorReader.search method to do KNN-based search. Next steps may
include some integration with lexical, inverted-index type search so
that you can retrieve N-cl
You might want to check out
https://issues.apache.org/jira/browse/LUCENE-8019 where I tried to
implement some debugging utilities on top of Explain. It never got
committed, but it does explore some of the challenges around
introducing a more structured explain response.
On Fri, Apr 9, 2021 at 6:40
See https://issues.apache.org/jira/browse/LUCENE-9640
On Wed, Mar 17, 2021 at 4:02 PM Paul Libbrecht
wrote:
>
> Explain is a heavyweight thing. Maybe it helps you, maybe you need
> something high-performance.
>
> I was asking a similar question ~10 years ago and got a very interesting
> answer on
s a version stamp X-2 or
> older.
>
> Best,
> Erick
>
> > On Nov 20, 2020, at 7:57 AM, Michael Sokolov wrote:
> >
> > I think running the upgrade tool would also be necessary to set you up for
> > the next upgrade, when 9.0 comes along.
> >
> > O
I think running the upgrade tool would also be necessary to set you up for
the next upgrade, when 9.0 comes along.
On Fri, Nov 20, 2020, 4:25 AM Uwe Schindler wrote:
> Hi,
>
> > Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1.
> Should
> > I do reindexing in this case ?
>
> No
You can't directly compare disk usage across two indexes, even with
the same data. Try re-indexing one of your datasets, and you will see
that the disk size is not the same. Mostly this is due to the way
segments are merged varying with some randomness from one run to
another, although the size of
A1, D, A2 (binding)
On Fri, Sep 4, 2020 at 12:46 AM David Smiley wrote:
>
> (binding)
> vote: D, A1
>
>
> (thanks Ryan for your thorough vote instructions & preparation)
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apa
So ... this is a fairly complex topic I can't really cover it in depth
here; how to architect a distributed search engine service. Most
people opt to use Solr or Elasticsearch since they solve that problem
for you. Those systems work best when the indexes are local to the
service that is accessing
A1, binding
On Mon, Aug 31, 2020 at 8:26 PM Ryan Ernst wrote:
>
> Dear Lucene and Solr developers!
>
> In February a contest was started to design a new logo for Lucene
> [jira-issue]. The initial attempt [first-vote] to call a vote resulted in
> some confusion on the rules, as well the request
If you are trying to show documents that have facet value V1 excluding
those with facet value V1.1, then you would need to issue a query
like:
+f:V1 -f:V1.1
assuming your facet values are indexed in a field called "f". I don't
think this really has anything to do with faceting; it's just a
fi
We have some prototype implementations in the issues you found. If
you want to try out the approaches in those issues, you could build
Lucene from source and patch it, but there is no release containing
KNN/vector support. We're still working to establish consensus on what
the best way forward is.
; stateful or has to store an state that should be available later.
> Or, on the other hand, understand if there is an order in the methods calls
> (first getValues then needsScores, first advanceExact then doubleValue).
> Don't you agree?
>
>
> On Mon, Jul 6, 2020 at 4:5
I found that when there is explicit code many implementations returns
> directly: false.
>
> What does this mean? why and when should I return true or false?
>
>
> On Mon, Jul 6, 2020 at 2:50 PM Michael Sokolov wrote:
>
> > Did you read the DoubleValuesSourc
Did you read the DoubleValuesSource javadocs, and find they weren't enough?
On Sun, Jul 5, 2020 at 7:54 AM Vincenzo D'Amore wrote:
>
> Hi all,
>
> Finally I have a custom DoubleValuesSource that gives the expected results,
> but I'm a little worried about the lack of documentation.
>
> When you e
s at capacity, I just return 0 for any docs that had a
> > boolean query score smaller than the min in the queue.
> >
> > But you can actually forget entirely that this ScoreFunction exists. It
> > only contributes ~6% of the runtime.
> > Even if I only use the Boole
You might consider using a TermInSetQuery in place of a BooleanQuery
for the hashes (since they are all in the same field).
I don't really understand why you are seeing so much cost in the heap
- it's sounds as if you have a single heap with mixed scores - those
generated by the BooleanQuery and t
A
non-PMC
On Tue, Jun 16, 2020 at 4:52 PM Bruno Roustant wrote:
>
> C - current logo
> not PMC
>
> Le mar. 16 juin 2020 à 21:38, Erik Hatcher a écrit :
>>
>> C - current logo
>>
>> On Jun 15, 2020, at 6:08 PM, Ryan Ernst wrote:
>>
>> Dear Lucene and Solr developers!
>>
>> In February a contest
e this approximation or at least can get the approximated value, so
> that I can use it for my own calculations.
>
> On 2020-06-02 18:48, Michael Sokolov wrote:
> > You could append an EOF token to every indexed text, and then iterate
> > over Terms to get the positions o
You could append an EOF token to every indexed text, and then iterate
over Terms to get the positions of those tokens?
On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger
wrote:
>
> Hello,
>
> I am not sure if I am at the right place here, but I got a question about
> the approximation my Lucene im
So -- you update a single document and the call to updateDocument
takes 3 minutes? Or you update a single document and call commit() and
that takes 3 minutes? Or -- you update 10 documents and call
commit() and that takes 3 minutes? We can't help you with the level of
detail you've provided. As
I don't know of any pre-existing thing that does exactly this, but how
about a token filter that counts tokens (or positions maybe), and then
appends some special token encoding the length?
On Sat, Dec 28, 2019, 9:36 AM Matt Davis wrote:
> Hello,
>
> I was wondering if it is possible to search f
Have you tried making a BooleanQuery with a term for every word in the
query document as Optional? You will get a lot of matches, ranked
according to the similarity.
On Thu, Dec 12, 2019 at 10:47 AM John Brown wrote:
>
> Hi,
>
>
>
> I have some questions about how to use Lucene for the specific
In Solr and ES this is done with faceting and aggregations,
respectively, based on Lucene's low-level APIs. Have you looked at
TermsEnum? You can use that to get all distinct terms for a segment,
and then it is up to you to coalesce terms across segments ("leaves").
On Thu, Nov 21, 2019 at 1:15 AM
ause of it does not check if the
> character is letter or not.
> e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.
>
> To me, this behaviour is beyond stemming.
>
> Tomoko
>
> 2019年7月28日(日) 4:55 Michael Sokolov :
> >
> > I&
I'm not so sure. I think the whole idea of having both stemmers is that the
minimal one does less than the light one.
Removing the final character of a double letter suffix is going to
sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
others.
So having both options is hel
ocument,
> and if the term also matches an ignore word, then ignore the match.
>
> I hadn't considered the stopwords approach, I'll look into that.
> If I add all the ignore words as stop words, will that effect highlighting?
> Are the stopwords still available for highlight
1 - 100 of 283 matches
Mail list logo