Re: Solr 9.4 - Help regarding vector search min Similarity threshold with knn parser

Alessandro Benedetti Mon, 29 Jan 2024 02:47:39 -0800

in line:
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*


e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Sun, 28 Jan 2024 at 16:36, kumar gaurav <kg2...@gmail.com> wrote:

> HI Charlie and Alessandro
>
> Thank you very much for replying. It is very helpful.
>
> Both of your links are very useful. I am very grateful to you both for
> this. Both of you are suggesting hybrid search.
>
> Alessandro
> I have read this
> https://sease.io/2023/12/hybrid-search-with-apache-solr.html link already
> and experimented. I am getting keyword search results + vector search
> results in a query.
> I am a very big fan of your work and always follow https://sease.io
> tutorials regarding Solr nural search.
>

*Ale*: thanks! That's much appreciated, I'm proud and happy to see that my
and my team's work helps people around the world!  :D

>
> I have some questions and clarification
>
> 1. I am totally getting your point that scores are generated dynamically so
> I am not sure how you will implement the cutoff feature
> I used "fq":"{!frange l=0.4}query($q,0)" to remove docs less than score 0.4
> which works but it will not be helpful because scores are generated
> dynamically for each request.
>

*Ale*: there are possible heuristics that can be applied per query.
Unfortunately, they are purely based on the score, so they may not match
your relevance expectation if the embedder model is not that great. But I
have many ideas I would like to explore as soon as we get some sponsors.

>
> 2. I also want to ask you one question regarding taxonomy vector
> generation.
> In the context of ecommerce data, Do you recommend putting all field data
> into one sentence for vector fields or should use main fields like product
> name and category only?
>
*Ale*: when talking about taxonomies I am not convinced you should
vectorise the field as simple text at all, unless you may want to find
similarities across categories that are not exactly the same. I would
probably use the categorical information to add some elements to the vector
(or use them as separate features in a Learning To Rank model, along with
the vector similarity)

>
> 3. Regarding vector generation, which open source model do you recommend ?
> I have used BERT which is not correct in some cases.
>

*Ale*: that's not an easy answer, it depends on the language, domain, use
case, etc etc
We normally spend a few days with my team when doing these sorts of
investigations for our clients.
So I can't really answer quickly but maybe this blog will give you some
ideas:
https://sease.io/2023/06/how-to-choose-the-right-large-language-model-for-your-domain-open-source-edition.html


>
> Thanking you with my full heart. Will wait for your answers.
>
>
> Thanks & regards
> Kumar Gaurav
>
>
> On Fri, 26 Jan 2024 at 23:46, Alessandro Benedetti <a.benede...@sease.io>
> wrote:
>
> > Hi Kumar,
> > Knn search in Apache Solr doesn't support any min-threshold parameter.
> > To be honest, even if it did, you wouldn't be in a much better position:
> > your perceived relevance won't necessarily match the 0-1 cosine
> similarity
> > between your query and your vectors, and what you consider highly
> relevant
> > may have a score of 0.35 for one query and 0.96 for another.
> > Having such a parameter just delegates to the user the pain of setting
> up a
> > useful threshold, which, trust me, it's not an easy (or maybe doable)
> job.
> >
> > It's on my roadmap to add a sort of auto-cutting functionality based on
> the
> > document score and Lucene also added a threshold-based search (which we
> may
> > or may not port to Apache Solr).
> > In the meantime, you can play with Hybrid Search (which will also be
> > improved in the future):
> > https://sease.io/2023/12/hybrid-search-with-apache-solr.html
> >
> > Cheers
> >
> > --------------------------
> > *Alessandro Benedetti*
> > Director @ Sease Ltd.
> > *Apache Lucene/Solr Committer*
> > *Apache Solr PMC Member*
> >
> > e-mail: a.benede...@sease.io
> >
> >
> > *Sease* - Information Retrieval Applied
> > Consulting | Training | Open Source
> >
> > Website: Sease.io <http://sease.io/>
> > LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> > <https://twitter.com/seaseltd> | Youtube
> > <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> > <https://github.com/seaseltd>
> >
> >
> > On Fri, 26 Jan 2024 at 17:01, Charlie Hull <
> > ch...@opensourceconnections.com>
> > wrote:
> >
> > > Hi Kumar,
> > >
> > > kNN will return the k closest vectors, which as you've found out may
> not
> > > be very close at all. Most of the approaches we're seeing as we work
> > > with e-commerce clients involve combining kNN with a standard, lexical
> > > search in some way - combining the results from both, or using one to
> > > boost certain results. You might find this blog useful as it discusses
> > > some strategies for coping with what you've found
> > >
> > >
> >
> https://opensourceconnections.com/blog/2023/03/22/building-vector-search-in-chorus-a-technical-deep-dive/
> > >
> > > best
> > >
> > > Charlie
> > >
> > >
> > > On 26/01/2024 12:18, kumar gaurav wrote:
> > > > HI Srijan
> > > >
> > > > Thanks for replying.
> > > >
> > > > I am using the BERT open source model to generate vectors. Are you
> > aware
> > > of
> > > > any minSimilary parameter threshold in knn parser ?
> > > >
> > > > I am working with an ecommerce dataset. So I am getting the same non
> > > > relevant results and the same score if I am using any invalid search
> > > token
> > > > which is not present in my index.
> > > >
> > > > I want to apply some kind of minimum similarity threshold so I can
> > > > throw out the outliers and can get very nearest documents only.
> > > >
> > > >
> > > >
> > > > On Fri, 26 Jan 2024 at 17:05, Srijan <shree...@gmail.com> wrote:
> > > >
> > > >> I have been testing dense vector search on Solr and it's been
> working
> > > great
> > > >> for me so far. Mine is an image search use case using OpenAI's CLIP
> > > model
> > > >> but the configurations are pretty much the same as yours. What
> > embedding
> > > >> model are you using? And can you share a portion of the actual
> query?
> > > >>
> > > >> On Fri, Jan 26, 2024 at 6:16 AM kumar gaurav <kg2...@gmail.com>
> > wrote:
> > > >>
> > > >>> HI Everyone
> > > >>>
> > > >>> I am using vector search in Solr 9.4. I am using cosine similarity
> > with
> > > >> knn
> > > >>> parser.
> > > >>>
> > > >>> Same as the documentation
> > > >>>
> > > >>>
> > > >>
> > >
> >
> https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html
> > > >>> Schema
> > > >>> <fieldType name="knn_vector" class="solr.DenseVectorField"
> > > >>> vectorDimension="768" similarityFunction="cosine"/>
> > > >>> <field name="vector" type="knn_vector" indexed="true"
> stored="true"/>
> > > >>>
> > > >>> Query
> > > >>> q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
> > > >>>
> > > >>> The problem is it always returns docs even if it's not relevant.
> Even
> > > if
> > > >> I
> > > >>> am using the xyz keyword, knn parser is returning the documents
> which
> > > is
> > > >>> useless. I want to control the similarity of documents. I need
> highly
> > > >>> similar documents only. Does Solr have any parameter in the knn
> > parser
> > > >>> which controls the similarity threshold ?
> > > >>>
> > > >>> *How can I control the minimum Similarity threshold with knn parser
> > ?*
> > > >>>
> > > >>> Please help. Thanks in advance.
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Thanks & Regards
> > > >>> Kumar Gaurav
> > > >>>
> > > --
> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > Founding member of The Search Network and co-author of Searching the
> > > Enterprise
> > > tel/fax: +44 (0)8700 118334
> > > mobile: +44 (0)7767 825828
> > >
> > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > > Amtsgericht Charlottenburg | HRB 230712 B
> > > Geschäftsführer: John M. Woodell | David E. Pugh
> > > Finanzamt: Berlin Finanzamt für Körperschaften II
> > >
> > >
> >
>

Re: Solr 9.4 - Help regarding vector search min Similarity threshold with knn parser

Reply via email to