Re: Considering SOLR as our new infra

Shawn Heisey Fri, 13 Aug 2021 05:44:33 -0700

On 8/13/2021 2:25 AM, Albert Dfm wrote:

We got to know about SOLR, and we are very excited about it to replace our
current elasticsearch infra.Currently, our main issue is regarding data and
model size running on each machine.


*Our setup:*
1. We use the following search arch: 1st tier, the fast search (low
response time) with most likely data to be retrieved,
2. 2nd tier with the rest (including on-disk data)

We saw the all features (solr wabpage) provided by SOLr, and we would like
to ask about them, more specifically we would like to know:
1. Can we do text search and vector similarity?
2. Can we filter by metadata?
3. How about index/memory consumption? 1st tier needs around 4000M
embeddings vector (128 fp32) + metadata stored in memory
4. Can we execute models in the DB itself? (not outside SOLr). We have
per-user models, and we need a way of executing TensorFlow models on the
database to prevent moving data outside of the DB
5. Subsecond queries
6. Real-time indexing (or near real-time) of new data
7. Easily scalable

As Solr and ES both use Lucene for the vast majority of theirfunctionality, they have nearly identical overall capabilities. If EScan do it, Solr most likely can too. If the configs are nearly thesame, Solr and ES will have similar performance.

Number 3: The bottom line here is that we do not know, and we can'tknow. Any guess made by us about Solr or the ES team about ES would bejust that -- a guess. What works for one user with an index of aparticular size might be way too low or way too high for another userwith a similar size index. When we guess, we're always going to err onthe side of caution -- recommend significantly more resources than whatmight actually be required, so we know there will be enough. And wegenerally need a lot of information that you might not have yet in orderto make a guess. If it works in ES with X amount of resources, it willprobably also work in Solr with those resources too -- assuming that theconfigs are substantially similar. In example configs, Solr tends tohave a lot more features enabled than ES does, which is one reason thatES can claim that they perform better "out of the box". When theconfigs are actually similar, performance tends to be similar.


https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/

First 1 and 2: You could set up different indexes for this purpose. Solr doesn't provide a way to automatically move older data from oneindex to another. You would have to do that in your indexing software. For time-series data (think logs or similar), SolrCloud has the "TimeRouted Aliases" feature -- it creates a new collection for the mostrecent data, and then later another new collection will be created. Ihave never used the feature, though I do understand the concept.

1: Text search, definitely. Vector similarity, probably ... but becauseI do not know what this is, I do not want to say the answer isdefinitely yes. Solr provides a way to utilize Lucene TermVectors.2: Generally, yes. How you set up the schema and the nature of the datawill determine exactly what you can do with filters. This would be thecase for ES too.

3: See above.

4: I have no idea what you mean by this. But as I have said before, ifES can do it, Solr probably can too.5: If you have enough resources, particularly memory, Solr performsgreat. If the index is REALLY big, it might be difficult to arrange tohave enough unallocated memory for the OS to reliably cache the index. Neither Solr nor ES do that caching themselves, they rely on the OS tohandle it.6: Faster indexing generally means taking a hit on query performancewhenever you update the index and commit changes. This would be the casefor ES too.7: This is such a vague question that I cannot answer it without knowingEXACTLY what you mean.


Additional reading (disclaimer: I wrote this wiki page):

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems

Thanks,
Shawn

Re: Considering SOLR as our new infra

Reply via email to