Re: A Cassandra Storage Estimation Mechanism

Christophe Schmitz Wed, 18 Apr 2018 20:51:41 -0700

Hi Onmestester,

A few comments inline:


>
> 1. I'm using the real schema + > 3 nodes cluster
>

Since you are only interested in data usage, for simplicity, you could use
a single node cluster (your computer), and use RF=1. If your production
cluster will use RF=3, you will just need to multiply. This assumes that
your data model will distribute your partitions evenly across your cluster.



> 2. Required assumptions: Real input rate (200K per seconds that would be
> 150 Billions totally) and Real partition count(Unique Keys in partitions:
> 1.5 millions totally)
> 3. Instead of 150 billions, i'm doing 1 , 10 and 100 millions write so i
> would use 10, 100 and 1000 partitions proportionally! after each run, i
> would use 'nodetool flush'
> and using du -sh keyspace_dir, i would check the total disk usage of the
> rate, for example for rate 1 million, disk usage was 90 MB, so for 150Bil
> it would be 13 TB . then drop the schema and run the next rate.
> I would continue this until differential of two consecuence results, would
> be a tiny number.
> I've got a good estimation at rate 100 Millions. Actually i was doing the
> estimation for an already runnig production cluster
> and i knew the answer beforehand (just wanted to be sure about the idea),
> and estimation was equal to answer finally! but i'm worried that it was
> accidental.
> Finally the question: Is my estimation mechanism correct and would be
> applicable for any estimation and any project?
>

Running a simulation like you are doing should give you a very good
estimate, that looks correct to me, as long as you don't forget to clear
the auto-snapshots after you drop your table ;o)


> If not, how to estimate storage (How you estimate)?
>

Often, the data usage is driven by a single table, and often by a single
column in the table (i.e. a json text field of a few KB), in which case the
math is very simple and safe to execute, and this gives a good start.
Ideally, a simulation should be run, i.e. using cassandra-stress. The goal
is usually to confirm the throughput / latency. As a side effect, this also
gives the disk usage.

Hope it helps!

Cheers,

Christophe


>
> Thanks in advance
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>


-- 

*Christophe Schmitz - **VP Consulting*

AU: +61 4 03751980 / FR: +33 7 82022899

Re: A Cassandra Storage Estimation Mechanism

Reply via email to