Hi Onmestester, A few comments inline:
> > 1. I'm using the real schema + > 3 nodes cluster > Since you are only interested in data usage, for simplicity, you could use a single node cluster (your computer), and use RF=1. If your production cluster will use RF=3, you will just need to multiply. This assumes that your data model will distribute your partitions evenly across your cluster. > 2. Required assumptions: Real input rate (200K per seconds that would be > 150 Billions totally) and Real partition count(Unique Keys in partitions: > 1.5 millions totally) > 3. Instead of 150 billions, i'm doing 1 , 10 and 100 millions write so i > would use 10, 100 and 1000 partitions proportionally! after each run, i > would use 'nodetool flush' > and using du -sh keyspace_dir, i would check the total disk usage of the > rate, for example for rate 1 million, disk usage was 90 MB, so for 150Bil > it would be 13 TB . then drop the schema and run the next rate. > I would continue this until differential of two consecuence results, would > be a tiny number. > I've got a good estimation at rate 100 Millions. Actually i was doing the > estimation for an already runnig production cluster > and i knew the answer beforehand (just wanted to be sure about the idea), > and estimation was equal to answer finally! but i'm worried that it was > accidental. > Finally the question: Is my estimation mechanism correct and would be > applicable for any estimation and any project? > Running a simulation like you are doing should give you a very good estimate, that looks correct to me, as long as you don't forget to clear the auto-snapshots after you drop your table ;o) > If not, how to estimate storage (How you estimate)? > Often, the data usage is driven by a single table, and often by a single column in the table (i.e. a json text field of a few KB), in which case the math is very simple and safe to execute, and this gives a good start. Ideally, a simulation should be run, i.e. using cassandra-stress. The goal is usually to confirm the throughput / latency. As a side effect, this also gives the disk usage. Hope it helps! Cheers, Christophe > > Thanks in advance > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > -- *Christophe Schmitz - **VP Consulting* AU: +61 4 03751980 / FR: +33 7 82022899