Hi Hank
Very cool, thank you, will try to do this asap!
All the best
Michael
Am 19.05.24 um 01:42 schrieb Chang Hank:
Hey Michael,
I wrote the first version of my idea about implementing RRF in Lucene,
here the link of the code
https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9.
Right now I have some questions, one is about the shardIndex to be
returned, another one is the TotalHits value, please take a look at
the code and kindly leave some comments below.
Thanks,
Hank
On May 18, 2024, at 2:01 PM, Chang Hank <hackchang0...@gmail.com> wrote:
Or maybe we can first create an issue and PR based on the issue number?
WDYT?
Best,
Hank
On May 18, 2024, at 11:29 AM, Chang Hank <hackchang0...@gmail.com>
wrote:
Hey Michael,
Sorry I was a bit busy this week, but I’ve looked into the resources
you provided and also some useful advice from Alessandro and Adrien.
I have a briefly understanding of how RRF works, but I’m not quite
sure how we should implement it. Based on the advice from Alessandro
and Adrien, it seems we need to consider that the search results are
located at different shards. According to Alessandro, we should
aggregate the ranked lists from all distributed nodes and then apply
RRF.
Are we going to implement this aggregation logic inside our RRF method?
Also could you please create a PR so we can discuss more details
further?
All the best,
Hank
On May 13, 2024, at 10:09 AM, Michael Wechner
<michael.wech...@wyona.com> wrote:
Great, sounds like we have plan :-)
Hank and I can get started trying to understand the internals
better ...
Thanks
Michael
Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
Sure, we can make it work but in a distributed environment you
have to run first each query distributed (aggregating all nodes)
and then RRF on top of the aggregated ranked lists.
Doing RRF per node first and then aggregate per shard won't return
the same results I suspect.
When I go back to working on the task I'll be able to elaborate more!
Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: a.benede...@sease.io/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>
On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com> wrote:
> Maybe Adrien Grand and others might also have some feedback :-)
I'd suggest the signature to look something like `TopDocs
TopDocs#rrf(int topN, int k, TopDocs[] hits)` to be consistent
with `TopDocs#merge`. Internally, it should look at
`ScoreDoc#shardId` and `ScoreDoc#doc` to figure out which hits
map to the same document.
> Back in the day, I was reasoning on this and I didn't think
Lucene was the right place for an interleaving algorithm,
given that Reciprocal Rank Fusion is affected by distribution
and it's not supposed to work per node.
To me this is like `TopDocs#merge`. There are changes needed
on the application side to hook this call into the logic that
combines hits that come from multiple shards (multiple queries
in the case of RRF), but Lucene can still provide the merging
logic.
On Mon, May 13, 2024 at 1:41 PM Michael Wechner
<michael.wech...@wyona.com> wrote:
Thanks for your feedback Alessandro!
I am using Lucene independent of Solr or OpenSearch,
Elasticsearch, but would like to combine different result
sets using RRF, therefore think that Lucene itself could
be a good place actually.
Looking forward to your additional elaboration!
Thanks
Michael
Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti
<a.benede...@sease.io>:
This is not strictly related to Lucene, but I'll give a
talk at Berlin Buzzwords on how I am implementing
Reciprocal Rank Fusion in Apache Solr.
I'll resume my work on the contribution next week and
have more to share later.
Back in the day, I was reasoning on this and I didn't
think Lucene was the right place for an interleaving
algorithm, given that Reciprocal Rank Fusion is affected
by distribution and it's not supposed to work per node.
I think I evaluated the possibility of doing it as a
Lucene query or a Lucene component but then ended up with
a different approach.
I'll elaborate more when I go back to the task!
Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: a.benede...@sease.io/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>
On Sat, 11 May 2024 at 09:10, Michael Wechner
<michael.wech...@wyona.com> wrote:
sure, no problem!
Maybe Adrien Grand and others might also have some
feedback :-)
Thanks
Michael
Am 10.05.24 um 23:03 schrieb Chang Hank:
Thank you for these useful resources, please allow
me to spend some time look into it.
I’ll let you know asap!!
Thanks
Hank
On May 10, 2024, at 12:34 PM, Michael Wechner
<michael.wech...@wyona.com>
<mailto:michael.wech...@wyona.com> wrote:
also we might want to consider how this relates to
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html
In vector search reranking has become quite
popular, e.g.
https://docs.cohere.com/docs/reranking
IIUC LangChain (python) for example adds the
reranker as an argument to the searcher/retriever
https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/
So maybe the following might make sense as well
TopDocs topDocsKeyword =
keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector =
vectorSearcher.search(query, 50, new CohereReranker());
TopDocs topDocs = TopDocs.merge(new RRFRanker(),
topDocsKeyword, topDocsVector);
WDYT?
Thanks
Michael
Am 10.05.24 um 21:08 schrieb Michael Wechner:
great, yes, let's get started :-)
What about the following pseudo code, assuming
that there might be alternative ranking algorithms
to RRF
StoredFieldsKeyword storedFieldsKeyword =
indexReaderKeyword.storedFields();
StoredFieldsVector storedFieldsVector =
indexReaderKeyword.storedFields();
TopDocs topDocsKeyword =
keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector =
vectorSearcher.search(vectorQuery, 50);
Ranker ranker = new RRFRanker();
TopDocs topDocs = TopDocs.rank(ranker,
topDocsKeyword, topDocsVector);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document docK =
storedFieldsKeyword.document(scoreDoc.doc);
Document docV =
storedFieldsVector.document(scoreDoc.doc);
....
}
whereas also see
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html
WDYT?
Thanks
Michael
Am 10.05.24 um 20:01 schrieb Chang Hank:
Hi Michael,
Sounds good to me.
Let’s do it!!
Cheers,
Hank
On May 10, 2024, at 10:50 AM, Michael Wechner
<michael.wech...@wyona.com>
<mailto:michael.wech...@wyona.com> wrote:
Hi Hank
Very cool!
Adrien Grand suggested to implement it as a
utility method on the TopDocs class, and since
Adrien worked for a decade on Lucene
https://www.elastic.co/de/blog/author/adrien-grand
I guess it makes sense to follow his advice :-)
We could create a PR and work together on it,
WDYT? All the best Michael
Am 10.05.24 um 18:51 schrieb Chang Hank:
Hi Michael,
Thank you for the reply.
This is really a cool issue to work on, I’m
happy to work on this with you. I’ll try to do
research on RRF first.
Also, are we going to implement this on the
TopDocs class?
Best,
Hank
On May 9, 2024, at 11:08 PM, Michael Wechner
<michael.wech...@wyona.com>
<mailto:michael.wech...@wyona.com> wrote:
Hi Hank
Thanks for offering your help!
I recently suggested to implement RRF
(Reciprocal Rank Fusion)
https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz
but still have not found the time to really
work on this.
Maybe you would be interested to do this or
that we work on it together somehow?
Thanks
Michael
Am 10.05.24 um 07:27 schrieb Chang Hank:
Hi everyone,
I’m Hank Chang, currently studying
Information Retrieval topics. I’m really
interested in contributing to Apache Lucene
and enhance my understanding to the field.
I’ve reviewed several issues posted on the
Github repository but haven’t found a
straightforward starting point. Could someone
please recommend suitable issues for a
newcomer like me or suggest areas I could
assist with?
Thank you for your time and guidance.
Best regards,
Hank Chang
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:
dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:
dev-h...@lucene.apache.org
--
Adrien