Re: Cassandra + Solr

Jack Krupansky Sat, 04 Oct 2014 08:01:00 -0700

The "requirement" is only that the Lucene (Solr) index fit in system memorythat the OS uses for file caching. SSDs are another matter. If somebody orsome information source is telling you that the index must fit "in heap",please identify the source so that we can correct them.

If the index doesn't fit in the OS file system cache then you risk becomingI/O bound as Lucene does I/O to access various portions of the index.

But it all depends on your access patterns. I mean, if you typically aren'taccessing all of your fields, then not all of the index will need to flowthrough memory or the file system cache.

But as a general proposition, plan on having enough system memory to "cache"your Lucene/Solr index. SSDs... are a bit different, but I'd still want mostof the index to fit in RAM.


-- Jack Krupansky

-----Original Message-----From: Robert Wille

Sent: Saturday, October 4, 2014 9:27 AM
To: user@cassandra.apache.org
Subject: Cassandra + Solr

I am architecting a solution for moving a large number of documents out ofour MySQL database to C*. We use Solr to index these documents. I’verecently become aware of a few different packages that integrate C* andSolr. At first blush, this seems like the perfect fit, as it would eliminatea complicated and somewhat fragile system that manages the indexing of ourdocuments. However, I have a significant concern that would certainly be ashow-stopper, unless I’m mistaken about some assumptions I’ve made. If anyof you can confirm that my concern is justified, or let me know where I’mwrong, I’d greatly appreciate it.

So here’s what I think will be an issue. The guiding principle in setting upour current Solr cluster (which was done before my time), is that the indexhas to fit in the heap. Each node in our current Solr cluster has 32 GB ofRAM and runs with a 28 GB heap, and serves up less than 28 GB of index. WhenI think about combining Solr and C*, it would seem that I’d need to cough up8 GB for C*, leaving 20 GB for Solr. The entire Solr index and the entireMySQL database are roughly the same size. If I assume that the data willconsume roughly the same space on C* as it does on MySQL, then each C* wouldbe limited to roughly the same amount of data as the Solr index, or about 20GB. If I have a replication factor of 3, then I need a node for each 7 GB ofdata. That is obviously a problem. We have about 1 TB of data. It would takeabout 150 nodes.

Is C*/Solr integration only feasible for applications for which a smallsubset of the data is indexed in Solr, or am I mistaken about therequirement that the Solr index be able to fit in heap?


Thanks in advance

Robert

Re: Cassandra + Solr

Reply via email to