I am architecting a solution for moving a large number of documents out of our 
MySQL database to C*. We use Solr to index these documents. I’ve recently 
become aware of a few different packages that integrate C* and Solr. At first 
blush, this seems like the perfect fit, as it would eliminate a complicated and 
somewhat fragile system that manages the indexing of our documents. However, I 
have a significant concern that would certainly be a show-stopper, unless I’m 
mistaken about some assumptions I’ve made. If any of you can confirm that my 
concern is justified, or let me know where I’m wrong, I’d greatly appreciate it.
So here’s what I think will be an issue. The guiding principle in setting up 
our current Solr cluster (which was done before my time), is that the index has 
to fit in the heap. Each node in our current Solr cluster has 32 GB of RAM and 
runs with a 28 GB heap, and serves up less than 28 GB of index. When I think 
about combining Solr and C*, it would seem that I’d need to cough up 8 GB for 
C*, leaving 20 GB for Solr. The entire Solr index and the entire MySQL database 
are roughly the same size. If I assume that the data will consume roughly the 
same space on C* as it does on MySQL, then each C* would be limited to roughly 
the same amount of data as the Solr index, or about 20 GB. If I have a 
replication factor of 3, then I need a node for each 7 GB of data. That is 
obviously a problem. We have about 1 TB of data. It would take about 150 nodes.

Is C*/Solr integration only feasible for applications for which a small subset 
of the data is indexed in Solr, or am I mistaken about the requirement that the 
Solr index be able to fit in heap?

Thanks in advance

Robert

Reply via email to