Re: Considering SOLR as our new infra

Alessandro Benedetti Mon, 16 Aug 2021 03:32:36 -0700

Hi Albert,
on top of the very good answers already in the thread, in line:

*1. Can we do text search and vector similarity?*
Lucene can do Vector similarity and you can achieve the same with Solr with
some caveats.
Direct and full support is still a work in progress, here are some
resources for you:
*London Information Retrieval Meetup*
We discussed the topic a few months ago at the London Information Retrieval
Meetup:
https://www.slideshare.net/SeaseLtd/interactive-questions-and-answers-london-information-retrieval-meetup
https://www.youtube.com/watch?v=BIILaSb4aRY&t=259s
*Blogs*
I started a series of blogs on the topic, so far only the intro:
https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html
But within the end of the summer I am planning on writing the Lucene, Solr
and Elasticsearch episode
*Training*
We are also hosting a related training in October, I take the chance to
link it in case you find it useful:
https://sease.io/training/artificial-intelligence-in-search-training

*2. Can we filter by metadata?*
Yes, pretty much similar to Elasticsearch with query (scored) and filter
query (un-scored).
It's a big topic though, take a look at the standard query parser to have
an idea:
https://solr.apache.org/guide/8_9/the-standard-query-parser.html

*3. How about index/memory consumption? 1st tier needs around
4000Membeddings vector (128 fp32) + metadata stored in memory*
No quick silver-bullet answer for this, you need to be much deeper in the
project to then build a prototype and benchmarking infrastructure that can
give you the answers

*4. Can we execute models in the DB itself? (not outside SOLr). We
haveper-user models, and we need a way of executing TensorFlow models on
thedatabase to prevent moving data outside of the DB*
The closer you get is the Learning To Rank integration.
Apache Solr supports linear models, tree-based models, and neural networks
based models.
You need to train your model, export it in the supported JSON format and
then use it:
https://solr.apache.org/guide/8_9/learning-to-rank.html
We have written many blogs on the topic:
https://sease.io/category/learning-to-rank
https://sease.io/2016/10/apache-solr-learning-to-rank-better-part-4.html
<https://sease.io/category/learning-to-rank>
And have also a training dedicated:
https://sease.io/training/learning-to-rank-training

*5. Subsecond queries*
You are generally well under the second, even integrating with complex
learning to rank, ranking models.
The more complex your matching and ranking algorithm, the slower (but in
general Apache Solr is super fast and you shouldn't have problems.)

*6. Real-time indexing (or near real-time) of new data*
Since Soft commits (that arrived many years ago) Apache Solr is quite good
in this.
https://solr.apache.org/guide/8_9/updatehandlers-in-solrconfig.html
https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

<https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/>*7.
Easily scalable*
You have this covered:
https://solr.apache.org/guide/8_9/solrcloud.html

Good Luck!

--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io

On Fri, 13 Aug 2021 at 17:33, Jan Høydahl <jan....@cominvent.com> wrote:

> I know you are in the Solr forum here, but I'll take the chance of
> mentioning the new kid on the block wrt open source search engines, namely
> Vespa. Since your use case seems to be highly geared towards
> personalization, it may be worth checking it out as they seem to push
> Tensors and personalized results as key differentiator. It is not Lucene
> based and may be quite different from what you already know with ES and
> Solr, and to be honest I have never tested it, nor am I affiliated in any
> way. Here's the link: https://vespa.ai/
>
> Jan
>
> > 13. aug. 2021 kl. 16:26 skrev Albert Dfm <alberich...@gmail.com>:
> >
> > For example, for relevance ranking the usual approach is to execute a
> > machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> > pytorch are other frameworks to build machine learning models.
> > While xgboost and lightgbm are ensembles of decision trees, tensorflow
> and
> > pytorch are mainly related to neutal networks.
> >
> > Elasticsearch allows to execute xgboost models for example for relevance
> > ranking.
> > The question could be applied similarly to SOLr: can we use pytorch or
> > tensorflow at relevance ranking phase?
> >
> >
> >
> > On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <apa...@elyograg.org>
> wrote:
> >
> >> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> >>> Regarding executing models (question number 4), let me explain this a
> bit
> >>> better:
> >>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> >>> lucene, it is something on top of it.
> >>
> >> With that info, I am even less familiar with what you're doing than I
> >> was before.  I have no idea what either of those things are.  Google
> >> wasn't helpful ... I probably would have to spend a week or two
> >> researching to even have a minimal understanding.  I was able to tell
> >> that it's probably related to machine learning, but that's all.  I have
> >> zero experience in that arena.
> >>
> >> It's unlikely that Solr has any direct support for those software
> >> programs, but if they can build queries that Solr understands, you could
> >> probably get something going.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Considering SOLR as our new infra

Reply via email to