I was going to estimate Hardware requirements for a project which mainly uses 
Apache Cassandra. 

Because of rule "Cassandra nodes size better be < 2 TB", the total disk 
usage determines number of nodes,

and in most cases the result of this calculation would be so OK for satisfying 
the required input rate.

So IMHO the storage estimation is the most important part of requirement 
analysis in this kind of projects.

There are some formula's on the net for theorotical storage estimation but the 
results would be some KB on each row while actual inserts shows a few hundred 
bytes!

So It seems like that the best estimation would be insert alot of real data in 
real schema of real production server.

But i can't have the real data and production cluster before the estimation!

So i came up with an estimation idea:



1. I'm using the real schema + > 3 nodes cluster

2. Required assumptions: Real input rate (200K per seconds that would be 150 
Billions totally) and Real partition count(Unique Keys in partitions: 1.5 
millions totally)

3. Instead of 150 billions, i'm doing 1 , 10 and 100 millions write so i would 
use 10, 100 and 1000 partitions proportionally! after each run, i would use 
'nodetool flush'

and using du -sh keyspace_dir, i would check the total disk usage of the rate, 
for example for rate 1 million, disk usage was 90 MB, so for 150Bil it would be 
13 TB . then drop the schema and run the next rate.

I would continue this until differential of two consecuence results, would be a 
tiny number.

I've got a good estimation at rate 100 Millions. Actually i was doing the 
estimation for an already runnig production cluster

and i knew the answer beforehand (just wanted to be sure about the idea), and 
estimation was equal to answer finally! but i'm worried that it was accidental.

Finally the question: Is my estimation mechanism correct and would be 
applicable for any estimation and any project?

If not, how to estimate storage (How you estimate)?



Thanks in advance



Sent using Zoho Mail





Reply via email to