Re: Considering SOLR as our new infra

Albert Dfm Fri, 13 Aug 2021 06:59:50 -0700

Thanks a lot Shawn for the very detailed reply, very informative and much
appreciated!!
I will check the link for performance problems.


Regarding executing models (question number 4), let me explain this a bit
better:
Can SOLr run custom tensorflow/pytorch models? This is not a feature in
lucene, it is something on top of it.

Thanks!!


On Fri, Aug 13, 2021 at 2:44 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 8/13/2021 2:25 AM, Albert Dfm wrote:
> > We got to know about SOLR, and we are very excited about it to replace
> our
> > current elasticsearch infra.Currently, our main issue is regarding data
> and
> > model size running on each machine.
> >
> > *Our setup:*
> > 1. We use the following search arch: 1st tier, the fast search (low
> > response time) with most likely data to be retrieved,
> > 2. 2nd tier with the rest (including on-disk data)
> >
> > We saw the all features (solr wabpage) provided by SOLr, and we would
> like
> > to ask about them, more specifically we would like to know:
> > 1. Can we do text search and vector similarity?
> > 2. Can we filter by metadata?
> > 3. How about index/memory consumption? 1st tier needs around 4000M
> > embeddings vector (128 fp32) + metadata stored in memory
> > 4. Can we execute models in the DB itself? (not outside SOLr). We have
> > per-user models, and we need a way of executing TensorFlow models on the
> > database to prevent moving data outside of the DB
> > 5. Subsecond queries
> > 6. Real-time indexing (or near real-time) of new data
> > 7. Easily scalable
>
>
> As Solr and ES both use Lucene for the vast majority of their
> functionality, they have nearly identical overall capabilities. If ES
> can do it, Solr most likely can too.  If the configs are nearly the
> same, Solr and ES will have similar performance.
>
> Number 3: The bottom line here is that we do not know, and we can't
> know.  Any guess made by us about Solr or the ES team about ES would be
> just that -- a guess.  What works for one user with an index of a
> particular size might be way too low or way too high for another user
> with a similar size index.  When we guess, we're always going to err on
> the side of caution -- recommend significantly more resources than what
> might actually be required, so we know there will be enough.  And we
> generally need a lot of information that you might not have yet in order
> to make a guess.  If it works in ES with X amount of resources, it will
> probably also work in Solr with those resources too -- assuming that the
> configs are substantially similar.  In example configs, Solr tends to
> have a lot more features enabled than ES does, which is one reason that
> ES can claim that they perform better "out of the box".  When the
> configs are actually similar, performance tends to be similar.
>
>
> https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/
>
> First 1 and 2: You could set up different indexes for this purpose.
> Solr doesn't provide a way to automatically move older data from one
> index to another.  You would have to do that in your indexing software.
> For time-series data (think logs or similar), SolrCloud has the "Time
> Routed Aliases" feature -- it creates a new collection for the most
> recent data, and then later another new collection will be created.  I
> have never used the feature, though I do understand the concept.
>
> 1: Text search, definitely.  Vector similarity, probably ... but because
> I do not know what this is, I do not want to say the answer is
> definitely yes.  Solr provides a way to utilize Lucene TermVectors.
> 2: Generally, yes.  How you set up the schema and the nature of the data
> will determine exactly what you can do with filters. This would be the
> case for ES too.
> 3: See above.
> 4: I have no idea what you mean by this.  But as I have said before, if
> ES can do it, Solr probably can too.
> 5: If you have enough resources, particularly memory, Solr performs
> great.  If the index is REALLY big, it might be difficult to arrange to
> have enough unallocated memory for the OS to reliably cache the index.
> Neither Solr nor ES do that caching themselves, they rely on the OS to
> handle it.
> 6: Faster indexing generally means taking a hit on query performance
> whenever you update the index and commit changes. This would be the case
> for ES too.
> 7: This is such a vague question that I cannot answer it without knowing
> EXACTLY what you mean.
>
> Additional reading (disclaimer: I wrote this wiki page):
>
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>

Re: Considering SOLR as our new infra

Reply via email to