Re: Cassandra disk usage

Michal Michalski Sun, 13 Apr 2014 10:43:12 -0700

> Each columns have name of 15 chars ( digits ) and same 15 chars in value
( also digits ).
> Each column should have 30 bytes.


Remember about the standard Cassandra's column overhead which is, as far as
I remember, 15 bytes, so it's 45 bytes in total - 50% more than you
estimated, which kind of matches your 3 GB vs 4.5 GB case.

There's also a per-row overhead, but I'm not sure about its size in current
C* versions - I remember it was about 25 bytes or so some time ago, but
it's not important in your case.

Kind regards,
Michał Michalski,
michal.michal...@boxever.com


On 13 April 2014 17:48, Yulian Oifa <oifa.yul...@gmail.com> wrote:

> Hello Mark and thanks for you reply.
> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
> take 1 byte each digit. Since all characters are digits it should have 15
> bytes.
> 2) I will change the data i am storing to decrease the usage , in value i
> will find some small value to store.Previously i used same value since this
> table is index only for search purposed and does not really has value.
> 3) You are right i read and write in quorum and it was my mistake ( i
> though that if i write in quorum then data will be written to 2 nodes only).
> If i check the keyspace
> create keyspace USER_DATA
>   with placement_strategy = 'NetworkTopologyStrategy'
>   and strategy_options = [{19 : 3}]
>   and durable_writes = true;
>
> it has replication factor of 3.
> Therefore i have several questions
> 1) What should be recommended read and write consistency and replication
> factor for 3 nodes with option of future increase server numbers?
> 2) Still it has 1.5X of overall data how can this be resolved and what is
> reason for that?
> 3) Also i see that data is in different size on all nodes , does that
> means that servers are out of sync???
>
> Thanks and best regards
> Yulian Oifa
>
>
> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <mark.re...@boxever.com>wrote:
>
>> What are you storing these 15 chars as; string, int, double, etc.? 15
>> chars does not translate to 15 bytes.
>>
>> You may be mixing up replication factor and quorum when you say "Cassandra
>> cluster has 3 servers, and data is stored in quorum ( 2 servers )." You
>> read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
>> data is replicated to the number of nodes you specify in your replication
>> factor. Could you clarify?
>>
>> Also if you are concerned about disk usage, why are you storing the same
>> 15 char value in both the column name and value? You could just store it as
>> the name and half your data usage :)
>>
>>
>>
>>
>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oifa.yul...@gmail.com>wrote:
>>
>>> I have column family with 2 raws.
>>> 2 raws have overall 100 million columns.
>>> Each columns have name of 15 chars ( digits ) and same 15 chars in value
>>> ( also digits ).
>>> Each column should have 30 bytes.
>>> Therefore all data should contain approximately 3GB.
>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>>> servers ).
>>> Therefore each server should have 3GB*2/3=2GB of data for this column
>>> family.
>>> Table is almost never changed , data is only removed from this table ,
>>> which possibly created tombstones , but it should not increase the usage.
>>> However when i check the data i see that each server has more then 4GB
>>> of data ( more then twice of what should be).
>>>
>>> server 1:
>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
>>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
>>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
>>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db
>>>
>>> server 2:
>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
>>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
>>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
>>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
>>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
>>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
>>> -rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db
>>>
>>>
>>> server 3:
>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
>>> -rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
>>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>>> freeNumbers-g-358-Statistics.db
>>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>>> freeNumbers-g-359-Statistics.db
>>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>>> freeNumbers-g-360-Statistics.db
>>> -rw-r--r-- 1 root root        976 Apr 11 18:20
>>> freeNumbers-g-358-Filter.db
>>> -rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
>>> -rw-r--r-- 1 root root         78 Apr 11 18:20 freeNumbers-g-358-Index.db
>>> -rw-r--r-- 1 root root         52 Apr 11 18:24 freeNumbers-g-359-Index.db
>>> -rw-r--r-- 1 root root         52 Apr 12 20:58 freeNumbers-g-360-Index.db
>>> -rw-r--r-- 1 root root         16 Apr 11 18:24
>>> freeNumbers-g-359-Filter.db
>>> -rw-r--r-- 1 root root         16 Apr 12 20:58
>>> freeNumbers-g-360-Filter.db
>>>
>>> When i try to compact i get the following notification from first server
>>> :
>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>>> CompactionController.java (line 146) Compacting large row
>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>>> bytes) incrementally
>>>
>>> Which confirms that there is around 4.5GB of data on that server only.
>>> Why does cassandra takes so much data???
>>>
>>> Best regards
>>> Yulian Oifa
>>>
>>>
>>
>

Re: Cassandra disk usage

Reply via email to