Re: Netstats > 100% streaming

2014-11-03 Thread Mark Reddy
HI Eric,

It looks like you are running into CASSANDRA-7878
 which is fixed
in 2.0.11 / 2.1.1


Mark

On 1 November 2014 14:08, Eric Stevens  wrote:

> We've been commissioning some new nodes on a 2.0.10 community edition
> cluster, and we're seeing streams that look like they're shipping way more
> data than they ought for individual files during bootstrap.
>
>
> /var/lib/cassandra/data/x/y/x-y-jb-11748-Data.db
> 3756423/3715409 bytes(101%) sent to /1.2.3.4
>
> /var/lib/cassandra/data/x/y/x-y-jb-11043-Data.db
> 584745/570432 bytes(102%) sent to /1.2.3.4
> /var/lib/cassandra/data/x/z/x-z-jb-525-Data.db
> 13020828/11141590 bytes(116%) sent to /1.2.3.4
> /var/lib/cassandra/data/x/w/x-w-jb-539-Data.db
> 1044124/51404 bytes(2031%) sent to /1.2.3.4
> /var/lib/cassandra/data/x/v/x-v-jb-546-Data.db
> 971447/22253 bytes(4365%) sent to /1.2.3.4
>
> /var/lib/cassandra/data/x/y/x-y-jb-10404-Data.db
> 6225920/23215181 bytes(26%) sent to /1.2.3.4
>
> Has anyone else seen something like this, and this something we should be
> worried about?  I haven't been able to find any information about this
> symptom.
>
> -Eric
>


Client-side compression, cassandra or both?

2014-11-03 Thread Robin Verlangen
Hi there,

We're working on a project which is going to store a lot of JSON objects in
Cassandra. A large piece of this (90%) consists of an array of integers, of
which in a lot of cases there are a bunch of zeroes.

The average JSON is 4KB in size, and once GZIP (default compression) just
under 100 bytes.

My question is, should we compress client-side (literally converting JSON
string to compressed gzip bytes), let Cassandra do the work, or do both?

>From my point of view I think Cassandra would be better, as it could
compress beyond a single value, using large blocks within a row / SSTable.

Thank you in advance for your help.

Best regards,

Robin Verlangen
*Chief Data Architect*

W http://www.robinverlangen.nl
E ro...@us2.nl


*What is CloudPelican? *

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.


Re: Client-side compression, cassandra or both?

2014-11-03 Thread DuyHai Doan
Hello Robin

 You have many options for compression in C*:

1) Serialized in bytes instead of JSON, to save a lot of space due to
String encoding. Of course the data will be opaque and not human readable

2) Activate client-node data compression. In this case, do not forget to
ship LZ4 or SNAPPY dependency on the client side.

On the server-side, data compression is active by default using LZ4 when
you're creating a new table so there is pretty much nothing to do.

 It's up to you to consider whether the compression ratio difference
between Gzip and LZ4 does worth relying on C* compression.


Regards


On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen  wrote:

> Hi there,
>
> We're working on a project which is going to store a lot of JSON objects
> in Cassandra. A large piece of this (90%) consists of an array of integers,
> of which in a lot of cases there are a bunch of zeroes.
>
> The average JSON is 4KB in size, and once GZIP (default compression) just
> under 100 bytes.
>
> My question is, should we compress client-side (literally converting JSON
> string to compressed gzip bytes), let Cassandra do the work, or do both?
>
> From my point of view I think Cassandra would be better, as it could
> compress beyond a single value, using large blocks within a row / SSTable.
>
> Thank you in advance for your help.
>
> Best regards,
>
> Robin Verlangen
> *Chief Data Architect*
>
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> 
> *What is CloudPelican? *
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>


new data not flushed to sstables

2014-11-03 Thread Sebastian Martinka
System and Keyspace Information:
4 Nodes
Cassandra 2.0.9
cqlsh 4.1.1
CQL spec 3.1.1
Thrift protocol 19.39.0

java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

CREATE KEYSPACE restore_test WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'};

CREATE TABLE inkr_test (
  objid int,
  creation_date timestamp,
  data text,
  PRIMARY KEY ((objid))
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.10 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.00 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};
--

I created for backup and incremental restore tests some rows.
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(1, dateof(now()),'ini. Load');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(2, dateof(now()),'ini. Load');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(3, dateof(now()),'ini. Load');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(4, dateof(now()),'ini. Load');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(5, dateof(now()),'ini. Load');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(6, dateof(now()),'ini. Load');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(7, dateof(now()),'ini. Load');

cqlsh:restore_test> Select * from inkr_test;

objid | creation_date| data
---+--+---
 5 | 2014-10-29 06:20:06+0100 | ini. Load
 1 | 2014-10-29 06:19:50+0100 | ini. Load
 2 | 2014-10-29 06:19:56+0100 | ini. Load
 4 | 2014-10-29 06:20:02+0100 | ini. Load
 7 | 2014-10-29 06:20:13+0100 | ini. Load
 6 | 2014-10-29 06:20:09+0100 | ini. Load
 3 | 2014-10-29 06:19:59+0100 | ini. Load

(7 rows)

Following I executed nodetool flush on all nodes to write the data on disk. 
After that, I created next rows.
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(8, dateof(now()),'nach Backup');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(9, dateof(now()),'nach Backup');
cqlsh:restore_test> insert into inkr_test (objid, creation_date, data) VALUES 
(10, dateof(now()),'nach Backup');
cqlsh:restore_test> Select * from inkr_test;

objid | creation_date| data
---+--+-
 5 | 2014-10-29 06:20:06+0100 |   ini. Load
10 | 2014-10-29 06:44:12+0100 | nach Backup
 1 | 2014-10-29 06:19:50+0100 |   ini. Load
 8 | 2014-10-29 06:44:00+0100 | nach Backup
 2 | 2014-10-29 06:19:56+0100 |   ini. Load
 4 | 2014-10-29 06:20:02+0100 |   ini. Load
 7 | 2014-10-29 06:20:13+0100 |   ini. Load
 6 | 2014-10-29 06:20:09+0100 |   ini. Load
 9 | 2014-10-29 06:44:04+0100 | nach Backup
 3 | 2014-10-29 06:19:59+0100 |   ini. Load

(10 rows)

Now, I executed nodetool flush only on node1 and checked the content from the 
created sstables:

[root@dev-stage-cassandra1 backup]# ll /opt/data/restore_test/inkr_test/
total 68
drwxr-xr-x 2 cassandra cassandra 4096 Oct 29 06:45 backups
-rw-r--r-- 1 cassandra cassandra   43 Oct 29 06:35 
restore_test-inkr_test-jb-2-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra  213 Oct 29 06:35 
restore_test-inkr_test-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra  336 Oct 29 06:35 
restore_test-inkr_test-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra   90 Oct 29 06:35 
restore_test-inkr_test-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4393 Oct 29 06:35 
restore_test-inkr_test-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra   80 Oct 29 06:35 
restore_test-inkr_test-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra   79 Oct 29 06:35 
restore_test-inkr_test-jb-2-TOC.txt
-rw-r--r-- 2 cassandra cassandra   43 Oct 29 06:45 
restore_test-inkr_test-jb-3-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra  130 Oct 29 06:45 
restore_test-inkr_test-jb-3-Data.db
-rw-r--r-- 2 cassandra cassandra   16 Oct 29 06:45 
restore_test-inkr_test-jb-3-Filter.db
-rw-r--r-- 2 cassandra cassandra   36 Oct 29 06:45 
restore_test-inkr_test-jb-3-Index.db
-rw-r--r-- 2 cassandra cassandra 4389 Oct 29 06:45 
restore_test-inkr_test-jb-3-Statistics.db
-rw-r--r-- 2 cassandra cassandra   80 Oct 29 06:45 
restore_test-inkr_test-jb-3-Summary.db
-rw-r--r-- 2 cassandra cassandra   79 Oct 29 06:45 
restore_test-inkr_test-jb-3-TOC.txt
[root@dev-stage-cassandra1 back

Re: Netstats > 100% streaming

2014-11-03 Thread Eric Stevens
Thanks Mark, that does seem like the right ticket.  Thanks for the info, I
wasn't successful finding that =)

On Mon, Nov 3, 2014 at 5:00 AM, Mark Reddy  wrote:

> HI Eric,
>
> It looks like you are running into CASSANDRA-7878
>  which is fixed
> in 2.0.11 / 2.1.1
>
>
> Mark
>
> On 1 November 2014 14:08, Eric Stevens  wrote:
>
>> We've been commissioning some new nodes on a 2.0.10 community edition
>> cluster, and we're seeing streams that look like they're shipping way more
>> data than they ought for individual files during bootstrap.
>>
>>
>> /var/lib/cassandra/data/x/y/x-y-jb-11748-Data.db
>> 3756423/3715409 bytes(101%) sent to /1.2.3.4
>>
>> /var/lib/cassandra/data/x/y/x-y-jb-11043-Data.db
>> 584745/570432 bytes(102%) sent to /1.2.3.4
>>
>> /var/lib/cassandra/data/x/z/x-z-jb-525-Data.db
>> 13020828/11141590 bytes(116%) sent to /1.2.3.4
>>
>> /var/lib/cassandra/data/x/w/x-w-jb-539-Data.db
>> 1044124/51404 bytes(2031%) sent to /1.2.3.4
>>
>> /var/lib/cassandra/data/x/v/x-v-jb-546-Data.db 971447/22253
>> bytes(4365%) sent to /1.2.3.4
>>
>> /var/lib/cassandra/data/x/y/x-y-jb-10404-Data.db
>> 6225920/23215181 bytes(26%) sent to /1.2.3.4
>>
>> Has anyone else seen something like this, and this something we should be
>> worried about?  I haven't been able to find any information about this
>> symptom.
>>
>> -Eric
>>
>
>


Re: Client-side compression, cassandra or both?

2014-11-03 Thread graham sanderson
I wouldn’t do both.
Unless a little server CPU or (and you’d have to measure it - I imagine it is 
probably not significant - as you say C* has more context, and hopefully most 
things can compress “0, “ repeatedly) disk space are an issue, I wouldn’t 
bother to compress yourself. Compression across the wire is good of course 
(client side CPU a wash, and server CPU we already mentioned anyway)

On a side note, perhaps your object model should address the redundancy, though 
of course this is perhaps equivalent to the complexity of doing client side 
compression, IDK.

We do have one table where we keep compressed blobs, but that is because those 
are natural from an application perspective, and so we just turn off C* table 
compression for those (there isn’t much other data there).

Note, I haven’t been tracking it recently, but certainly in the past the 
compression code path on the C* had to do more data copies, but this is not 
likely significant unless your case is special. I believe this has been/will be 
improved in 2.1 or 3.

> On Nov 3, 2014, at 9:40 AM, DuyHai Doan  wrote:
> 
> Hello Robin
> 
>  You have many options for compression in C*:
> 
> 1) Serialized in bytes instead of JSON, to save a lot of space due to String 
> encoding. Of course the data will be opaque and not human readable
> 
> 2) Activate client-node data compression. In this case, do not forget to ship 
> LZ4 or SNAPPY dependency on the client side. 
> 
> On the server-side, data compression is active by default using LZ4 when 
> you're creating a new table so there is pretty much nothing to do.
> 
>  It's up to you to consider whether the compression ratio difference between 
> Gzip and LZ4 does worth relying on C* compression.
> 
> 
> Regards
> 
> 
> On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen  > wrote:
> Hi there,
> 
> We're working on a project which is going to store a lot of JSON objects in 
> Cassandra. A large piece of this (90%) consists of an array of integers, of 
> which in a lot of cases there are a bunch of zeroes. 
> 
> The average JSON is 4KB in size, and once GZIP (default compression) just 
> under 100 bytes. 
> 
> My question is, should we compress client-side (literally converting JSON 
> string to compressed gzip bytes), let Cassandra do the work, or do both?
> 
> From my point of view I think Cassandra would be better, as it could compress 
> beyond a single value, using large blocks within a row / SSTable.
> 
> Thank you in advance for your help.
> 
> Best regards, 
> 
> Robin Verlangen
> Chief Data Architect
> 
> W http://www.robinverlangen.nl 
> E ro...@us2.nl 
> 
>  
> What is CloudPelican? 
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: new data not flushed to sstables

2014-11-03 Thread Bryan Talbot
On Mon, Nov 3, 2014 at 7:44 AM, Sebastian Martinka <
sebastian.marti...@mercateo.com> wrote:

>  System and Keyspace Information:
>
> 4 Nodes
>
>

> CREATE KEYSPACE restore_test WITH replication = {  'class':
> 'SimpleStrategy',
>
>   'replication_factor': '3'};
>
>
>
>
>
> I assumed,  that a flush write all data in the sstables and we can use it
> for backup and restore. Did I forget something or is my understanding
> wrong?
>
>
>
I think you forgot that with N=4 and RF=3 that each node will contain
approximately 75% of the data. From a quick eyeball check of the json-dump
you provided, it looks like partition-key values are contained on 3 nodes
and are absent from 1 which is exactly as expected.

-Bryan


Unsubscribe

2014-11-03 Thread Malay Nilabh
Hi

It was great to be part of this group. Thanks for helping out. Please 
unsubscribe me now.

Regards,
Malay Nilabh
BIDW BU/ Big Data CoE
L&T Infotech Ltd, Hinjewadi,Pune
[cid:image001.gif@01CFF80D.D1C29F90]: +91-20-66571746
[cid:image002.png@01CFF80D.D1C29F90]+91-73-879-00727
Email: malay.nil...@lntinfotech.com
|| Save Paper - Save Trees ||



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. L&T Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail"