Re: Cassandra disk usage

Yulian Oifa Mon, 14 Apr 2014 00:15:06 -0700

Hello
The load of data on 3 nodes is :

Address         DC          Rack        Status State   Load
Owns    Token


113427455640312821154458202477256070485
172.19.10.1     19          10          Up     Normal  22.16 GB
33.33%  0
172.19.10.2     19          10          Up     Normal  19.89 GB
33.33%  56713727820156410577229101238628035242
172.19.10.3     19          10          Up     Normal  30.74 GB
33.33%  113427455640312821154458202477256070485

Best regards
Yulian Oifa



On Sun, Apr 13, 2014 at 9:17 PM, Mark Reddy <mark.re...@boxever.com> wrote:

> i I will change the data i am storing to decrease the usage , in value i
>> will find some small value to store.Previously i used same value since this
>> table is index only for search purposed and does not really has value.
>
>
> If you don't need a value, you don't have to store anything. You can store
> the column name and leave the value empty, this is a common practice.
>
> 1) What should be recommended read and write consistency and replication
>> factor for 3 nodes with option of future increase server numbers?
>
>
> Both consistency level and replication factor are tuneable depending on
> your application constraints. I'd say a CL or quorum and RF of 3 is the
> general practice.
>
> Still it has 1.5X of overall data how can this be resolved and what is
>> reason for that?
>
>
> As Michał pointed out there is a 15 byte column overhead to consider
> here, where:
>
> total_column_size = column_name_size + column_value_size + 15
>
>
> This link might shed some light on this:
> http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html
>
> Also i see that data is in different size on all nodes , does that means
>> that servers are out of sync
>
>
> How much is it out by? Data size may differ due to deletes, as you
> mentioned you do deletes. What is the output of 'nodetool ring'?
>
>
> On Sun, Apr 13, 2014 at 6:42 PM, Michal Michalski <
> michal.michal...@boxever.com> wrote:
>
>> > Each columns have name of 15 chars ( digits ) and same 15 chars in
>> value ( also digits ).
>> > Each column should have 30 bytes.
>>
>> Remember about the standard Cassandra's column overhead which is, as far
>> as I remember, 15 bytes, so it's 45 bytes in total - 50% more than you
>> estimated, which kind of matches your 3 GB vs 4.5 GB case.
>>
>> There's also a per-row overhead, but I'm not sure about its size in
>> current C* versions - I remember it was about 25 bytes or so some time ago,
>> but it's not important in your case.
>>
>> Kind regards,
>> Michał Michalski,
>> michal.michal...@boxever.com
>>
>>
>> On 13 April 2014 17:48, Yulian Oifa <oifa.yul...@gmail.com> wrote:
>>
>>> Hello Mark and thanks for you reply.
>>> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
>>> take 1 byte each digit. Since all characters are digits it should have 15
>>> bytes.
>>> 2) I will change the data i am storing to decrease the usage , in value
>>> i will find some small value to store.Previously i used same value since
>>> this table is index only for search purposed and does not really has value.
>>> 3) You are right i read and write in quorum and it was my mistake ( i
>>> though that if i write in quorum then data will be written to 2 nodes only).
>>> If i check the keyspace
>>> create keyspace USER_DATA
>>>   with placement_strategy = 'NetworkTopologyStrategy'
>>>   and strategy_options = [{19 : 3}]
>>>   and durable_writes = true;
>>>
>>> it has replication factor of 3.
>>> Therefore i have several questions
>>> 1) What should be recommended read and write consistency and replication
>>> factor for 3 nodes with option of future increase server numbers?
>>> 2) Still it has 1.5X of overall data how can this be resolved and what
>>> is reason for that?
>>> 3) Also i see that data is in different size on all nodes , does that
>>> means that servers are out of sync???
>>>
>>> Thanks and best regards
>>> Yulian Oifa
>>>
>>>
>>> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <mark.re...@boxever.com>wrote:
>>>
>>>> What are you storing these 15 chars as; string, int, double, etc.? 15
>>>> chars does not translate to 15 bytes.
>>>>
>>>> You may be mixing up replication factor and quorum when you say "Cassandra
>>>> cluster has 3 servers, and data is stored in quorum ( 2 servers )."
>>>> You read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
>>>> data is replicated to the number of nodes you specify in your replication
>>>> factor. Could you clarify?
>>>>
>>>> Also if you are concerned about disk usage, why are you storing the
>>>> same 15 char value in both the column name and value? You could just store
>>>> it as the name and half your data usage :)
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oifa.yul...@gmail.com>wrote:
>>>>
>>>>> I have column family with 2 raws.
>>>>> 2 raws have overall 100 million columns.
>>>>> Each columns have name of 15 chars ( digits ) and same 15 chars in
>>>>> value ( also digits ).
>>>>> Each column should have 30 bytes.
>>>>> Therefore all data should contain approximately 3GB.
>>>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>>>>> servers ).
>>>>> Therefore each server should have 3GB*2/3=2GB of data for this column
>>>>> family.
>>>>> Table is almost never changed , data is only removed from this table ,
>>>>> which possibly created tombstones , but it should not increase the usage.
>>>>> However when i check the data i see that each server has more then 4GB
>>>>> of data ( more then twice of what should be).
>>>>>
>>>>> server 1:
>>>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02
>>>>> freeNumbers-g-264-Data.db
>>>>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24
>>>>> freeNumbers-g-281-Data.db
>>>>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27
>>>>> freeNumbers-g-284-Data.db
>>>>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07
>>>>> freeNumbers-g-336-Data.db
>>>>>
>>>>> server 2:
>>>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57
>>>>> freeNumbers-g-285-Data.db
>>>>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22
>>>>> freeNumbers-g-301-Data.db
>>>>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23
>>>>> freeNumbers-g-304-Data.db
>>>>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26
>>>>> freeNumbers-g-306-Data.db
>>>>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26
>>>>> freeNumbers-g-305-Data.db
>>>>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45
>>>>> freeNumbers-g-314-Data.db
>>>>> -rw-r--r-- 1 root root     413717 Apr 12 18:33
>>>>> freeNumbers-g-359-Data.db
>>>>>
>>>>>
>>>>> server 3:
>>>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20
>>>>> freeNumbers-g-358-Data.db
>>>>> -rw-r--r-- 1 root root     389171 Apr 12 20:58
>>>>> freeNumbers-g-360-Data.db
>>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>>>>> freeNumbers-g-358-Statistics.db
>>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>>>>> freeNumbers-g-359-Statistics.db
>>>>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>>>>> freeNumbers-g-360-Statistics.db
>>>>> -rw-r--r-- 1 root root        976 Apr 11 18:20
>>>>> freeNumbers-g-358-Filter.db
>>>>> -rw-r--r-- 1 root root        208 Apr 11 18:24
>>>>> freeNumbers-g-359-Data.db
>>>>> -rw-r--r-- 1 root root         78 Apr 11 18:20
>>>>> freeNumbers-g-358-Index.db
>>>>> -rw-r--r-- 1 root root         52 Apr 11 18:24
>>>>> freeNumbers-g-359-Index.db
>>>>> -rw-r--r-- 1 root root         52 Apr 12 20:58
>>>>> freeNumbers-g-360-Index.db
>>>>> -rw-r--r-- 1 root root         16 Apr 11 18:24
>>>>> freeNumbers-g-359-Filter.db
>>>>> -rw-r--r-- 1 root root         16 Apr 12 20:58
>>>>> freeNumbers-g-360-Filter.db
>>>>>
>>>>> When i try to compact i get the following notification from first
>>>>> server :
>>>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>>>>> CompactionController.java (line 146) Compacting large row
>>>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>>>>> bytes) incrementally
>>>>>
>>>>> Which confirms that there is around 4.5GB of data on that server only.
>>>>> Why does cassandra takes so much data???
>>>>>
>>>>> Best regards
>>>>> Yulian Oifa
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra disk usage

Reply via email to