I was going to estimate Hardware requirements for a project which mainly uses Apache Cassandra.
Because of rule "Cassandra nodes size better be < 2 TB", the total disk usage determines number of nodes, and in most cases the result of this calculation would be so OK for satisfying the required input rate. So IMHO the storage estimation is the most important part of requirement analysis in this kind of projects. There are some formula's on the net for theorotical storage estimation but the results would be some KB on each row while actual inserts shows a few hundred bytes! So It seems like that the best estimation would be insert alot of real data in real schema of real production server. But i can't have the real data and production cluster before the estimation! So i came up with an estimation idea: 1. I'm using the real schema + > 3 nodes cluster 2. Required assumptions: Real input rate (200K per seconds that would be 150 Billions totally) and Real partition count(Unique Keys in partitions: 1.5 millions totally) 3. Instead of 150 billions, i'm doing 1 , 10 and 100 millions write so i would use 10, 100 and 1000 partitions proportionally! after each run, i would use 'nodetool flush' and using du -sh keyspace_dir, i would check the total disk usage of the rate, for example for rate 1 million, disk usage was 90 MB, so for 150Bil it would be 13 TB . then drop the schema and run the next rate. I would continue this until differential of two consecuence results, would be a tiny number. I've got a good estimation at rate 100 Millions. Actually i was doing the estimation for an already runnig production cluster and i knew the answer beforehand (just wanted to be sure about the idea), and estimation was equal to answer finally! but i'm worried that it was accidental. Finally the question: Is my estimation mechanism correct and would be applicable for any estimation and any project? If not, how to estimate storage (How you estimate)? Thanks in advance Sent using Zoho Mail