Hello The load of data on 3 nodes is : Address DC Rack Status State Load Owns Token
113427455640312821154458202477256070485 172.19.10.1 19 10 Up Normal 22.16 GB 33.33% 0 172.19.10.2 19 10 Up Normal 19.89 GB 33.33% 56713727820156410577229101238628035242 172.19.10.3 19 10 Up Normal 30.74 GB 33.33% 113427455640312821154458202477256070485 Best regards Yulian Oifa On Sun, Apr 13, 2014 at 9:17 PM, Mark Reddy <mark.re...@boxever.com> wrote: > i I will change the data i am storing to decrease the usage , in value i >> will find some small value to store.Previously i used same value since this >> table is index only for search purposed and does not really has value. > > > If you don't need a value, you don't have to store anything. You can store > the column name and leave the value empty, this is a common practice. > > 1) What should be recommended read and write consistency and replication >> factor for 3 nodes with option of future increase server numbers? > > > Both consistency level and replication factor are tuneable depending on > your application constraints. I'd say a CL or quorum and RF of 3 is the > general practice. > > Still it has 1.5X of overall data how can this be resolved and what is >> reason for that? > > > As Michał pointed out there is a 15 byte column overhead to consider > here, where: > > total_column_size = column_name_size + column_value_size + 15 > > > This link might shed some light on this: > http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html > > Also i see that data is in different size on all nodes , does that means >> that servers are out of sync > > > How much is it out by? Data size may differ due to deletes, as you > mentioned you do deletes. What is the output of 'nodetool ring'? > > > On Sun, Apr 13, 2014 at 6:42 PM, Michal Michalski < > michal.michal...@boxever.com> wrote: > >> > Each columns have name of 15 chars ( digits ) and same 15 chars in >> value ( also digits ). >> > Each column should have 30 bytes. >> >> Remember about the standard Cassandra's column overhead which is, as far >> as I remember, 15 bytes, so it's 45 bytes in total - 50% more than you >> estimated, which kind of matches your 3 GB vs 4.5 GB case. >> >> There's also a per-row overhead, but I'm not sure about its size in >> current C* versions - I remember it was about 25 bytes or so some time ago, >> but it's not important in your case. >> >> Kind regards, >> Michał Michalski, >> michal.michal...@boxever.com >> >> >> On 13 April 2014 17:48, Yulian Oifa <oifa.yul...@gmail.com> wrote: >> >>> Hello Mark and thanks for you reply. >>> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should >>> take 1 byte each digit. Since all characters are digits it should have 15 >>> bytes. >>> 2) I will change the data i am storing to decrease the usage , in value >>> i will find some small value to store.Previously i used same value since >>> this table is index only for search purposed and does not really has value. >>> 3) You are right i read and write in quorum and it was my mistake ( i >>> though that if i write in quorum then data will be written to 2 nodes only). >>> If i check the keyspace >>> create keyspace USER_DATA >>> with placement_strategy = 'NetworkTopologyStrategy' >>> and strategy_options = [{19 : 3}] >>> and durable_writes = true; >>> >>> it has replication factor of 3. >>> Therefore i have several questions >>> 1) What should be recommended read and write consistency and replication >>> factor for 3 nodes with option of future increase server numbers? >>> 2) Still it has 1.5X of overall data how can this be resolved and what >>> is reason for that? >>> 3) Also i see that data is in different size on all nodes , does that >>> means that servers are out of sync??? >>> >>> Thanks and best regards >>> Yulian Oifa >>> >>> >>> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <mark.re...@boxever.com>wrote: >>> >>>> What are you storing these 15 chars as; string, int, double, etc.? 15 >>>> chars does not translate to 15 bytes. >>>> >>>> You may be mixing up replication factor and quorum when you say "Cassandra >>>> cluster has 3 servers, and data is stored in quorum ( 2 servers )." >>>> You read and write at quorum (N/2)+1 where N=total_number_of_nodes and your >>>> data is replicated to the number of nodes you specify in your replication >>>> factor. Could you clarify? >>>> >>>> Also if you are concerned about disk usage, why are you storing the >>>> same 15 char value in both the column name and value? You could just store >>>> it as the name and half your data usage :) >>>> >>>> >>>> >>>> >>>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oifa.yul...@gmail.com>wrote: >>>> >>>>> I have column family with 2 raws. >>>>> 2 raws have overall 100 million columns. >>>>> Each columns have name of 15 chars ( digits ) and same 15 chars in >>>>> value ( also digits ). >>>>> Each column should have 30 bytes. >>>>> Therefore all data should contain approximately 3GB. >>>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2 >>>>> servers ). >>>>> Therefore each server should have 3GB*2/3=2GB of data for this column >>>>> family. >>>>> Table is almost never changed , data is only removed from this table , >>>>> which possibly created tombstones , but it should not increase the usage. >>>>> However when i check the data i see that each server has more then 4GB >>>>> of data ( more then twice of what should be). >>>>> >>>>> server 1: >>>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 >>>>> freeNumbers-g-264-Data.db >>>>> -rw-r--r-- 1 root root 814699666 Dec 26 12:24 >>>>> freeNumbers-g-281-Data.db >>>>> -rw-r--r-- 1 root root 198432466 Dec 26 12:27 >>>>> freeNumbers-g-284-Data.db >>>>> -rw-r--r-- 1 root root 35883918 Apr 12 20:07 >>>>> freeNumbers-g-336-Data.db >>>>> >>>>> server 2: >>>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 >>>>> freeNumbers-g-285-Data.db >>>>> -rw-r--r-- 1 root root 762399716 Dec 26 12:22 >>>>> freeNumbers-g-301-Data.db >>>>> -rw-r--r-- 1 root root 220887062 Dec 26 12:23 >>>>> freeNumbers-g-304-Data.db >>>>> -rw-r--r-- 1 root root 54914466 Dec 26 12:26 >>>>> freeNumbers-g-306-Data.db >>>>> -rw-r--r-- 1 root root 53639516 Dec 26 12:26 >>>>> freeNumbers-g-305-Data.db >>>>> -rw-r--r-- 1 root root 53007967 Jan 8 15:45 >>>>> freeNumbers-g-314-Data.db >>>>> -rw-r--r-- 1 root root 413717 Apr 12 18:33 >>>>> freeNumbers-g-359-Data.db >>>>> >>>>> >>>>> server 3: >>>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 >>>>> freeNumbers-g-358-Data.db >>>>> -rw-r--r-- 1 root root 389171 Apr 12 20:58 >>>>> freeNumbers-g-360-Data.db >>>>> -rw-r--r-- 1 root root 4276 Apr 11 18:20 >>>>> freeNumbers-g-358-Statistics.db >>>>> -rw-r--r-- 1 root root 4276 Apr 11 18:24 >>>>> freeNumbers-g-359-Statistics.db >>>>> -rw-r--r-- 1 root root 4276 Apr 12 20:58 >>>>> freeNumbers-g-360-Statistics.db >>>>> -rw-r--r-- 1 root root 976 Apr 11 18:20 >>>>> freeNumbers-g-358-Filter.db >>>>> -rw-r--r-- 1 root root 208 Apr 11 18:24 >>>>> freeNumbers-g-359-Data.db >>>>> -rw-r--r-- 1 root root 78 Apr 11 18:20 >>>>> freeNumbers-g-358-Index.db >>>>> -rw-r--r-- 1 root root 52 Apr 11 18:24 >>>>> freeNumbers-g-359-Index.db >>>>> -rw-r--r-- 1 root root 52 Apr 12 20:58 >>>>> freeNumbers-g-360-Index.db >>>>> -rw-r--r-- 1 root root 16 Apr 11 18:24 >>>>> freeNumbers-g-359-Filter.db >>>>> -rw-r--r-- 1 root root 16 Apr 12 20:58 >>>>> freeNumbers-g-360-Filter.db >>>>> >>>>> When i try to compact i get the following notification from first >>>>> server : >>>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260 >>>>> CompactionController.java (line 146) Compacting large row >>>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689 >>>>> bytes) incrementally >>>>> >>>>> Which confirms that there is around 4.5GB of data on that server only. >>>>> Why does cassandra takes so much data??? >>>>> >>>>> Best regards >>>>> Yulian Oifa >>>>> >>>>> >>>> >>> >> >