Re: DataModelling to query date range

2016-03-24 Thread Chris Martin
Hi Vidur,

I had a go at your solution but the problem is that it doesn't match routes
which are valid all throughtout the range queried.  For example if I have
 route that is valid for all of Jan 2016. I will have a table that looks
something like this:

start   | end| valid
New York   Washington 2016-01-01
New York   Washington 2016-01-31

So if I query for ranges that have at least one bound outside Jan (e.g Jan
15 - Feb 15) then the query you gave will work fine.  If, however, I query
for a range that is completely inside Jan e.g all routes valid on Jan 15th,
 The I think I'll end up with a query like:

SELECT * from routes where start = 'New York' and end = 'Washington'
and valid <= 2016-01-15 and valid >= 2016-01-15.

which will return 0 results as it would only match routes that have a valid
of 2016-01-15 exactly.

 thanks,

Chris


On Wed, Mar 23, 2016 at 11:19 PM, Vidur Malik  wrote:

> Flip the problem over. Instead of storing validTo and validFrom, simply
> store a valid field and partition by (start, end). This may sound wasteful,
> but disk is cheap:
>
> CREATE TABLE routes (
> start text,
> end text,
> valid timestamp,
> PRIMARY KEY ((start, end), valid)
> );
>
> Now, you can execute something like:
>
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> valid <= 2016-01-31 and valid >= 2016-01-01.
>
>
> On Wed, Mar 23, 2016 at 5:08 PM, Chris Martin 
> wrote:
>
>> Hi all,
>>
>> I have a table that represents a train timetable and looks a bit like
>> this:
>>
>> CREATE TABLE routes (
>> start text,
>> end text,
>> validFrom timestamp,
>> validTo timestamp,
>> PRIMARY KEY (start, end, validFrom, validTo)
>> );
>>
>> In this case validFrom is the date that the route becomes valid and
>> validTo is the date that the route that stops becoming valid.
>>
>> If this was SQL I could write a query to find all valid routes between
>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>> like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>>
>> As far as I can tell such a query is impossible with CQL and my current
>> table structure.  I'm considering running a query like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31
>>
>> And then filtering the rest of the data app side.  This doesn't seem
>> ideal though as I'm going to end up fetching much more data (probably
>> around an order of magnitude more) from Cassandra than I really want.
>>
>> Is there a better way to model the data?
>>
>> thanks,
>>
>> Chris
>>
>>
>>
>>
>
>
> --
>
> Vidur Malik
>
> [image: ShopKeep] 
>
> 800.820.9814
> <8008209814>
> [image: ShopKeep]  [image: ShopKeep]
>  [image: ShopKeep]
> 
>


Re: DataModelling to query date range

2016-03-24 Thread Chris Martin
Ah- that looks interesting!  I'm actaully still on cassandra 2.x but I was
planning on updgrading anyway.  Once I do so I'll check this one out.


Chris


On Thu, Mar 24, 2016 at 2:57 AM, Henry M  wrote:

> I haven't tried the new SASI indexer but it may help:
> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>
>
> On Wed, Mar 23, 2016 at 2:08 PM, Chris Martin 
> wrote:
>
>> Hi all,
>>
>> I have a table that represents a train timetable and looks a bit like
>> this:
>>
>> CREATE TABLE routes (
>> start text,
>> end text,
>> validFrom timestamp,
>> validTo timestamp,
>> PRIMARY KEY (start, end, validFrom, validTo)
>> );
>>
>> In this case validFrom is the date that the route becomes valid and
>> validTo is the date that the route that stops becoming valid.
>>
>> If this was SQL I could write a query to find all valid routes between
>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>> like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>>
>> As far as I can tell such a query is impossible with CQL and my current
>> table structure.  I'm considering running a query like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31
>>
>> And then filtering the rest of the data app side.  This doesn't seem
>> ideal though as I'm going to end up fetching much more data (probably
>> around an order of magnitude more) from Cassandra than I really want.
>>
>> Is there a better way to model the data?
>>
>> thanks,
>>
>> Chris
>>
>>
>>
>>
>


Re: Large number of tombstones without delete or update

2016-03-24 Thread Ralf Steppacher
I can confirm that if I send JSON messages that always cover all schema fields 
the tombstone issue is not reported by Cassandra.
So, is there a way to work around this issue other than to always populate 
every column of the schema with every insert? That would be a pain in the 
backside, really.

Why would C* not warn about the excessive number of tombstones if queried from 
cqlsh?


Thanks!
Ralf



> On 23.03.2016, at 19:09, Robert Coli  wrote:
> 
> On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher  > wrote:
> How come I end up with that large a number of tombstones?
> 
> Are you inserting NULLs?
> 
> =Rob
>  



RE: DataModelling to query date range

2016-03-24 Thread Peer, Oded
You can change the table to support Multi-column slice restrictions

CREATE TABLE routes (
start text,
end text,
year int,
month int,
day int,
PRIMARY KEY (start, end, year, month, day)
);

Then using Multi-column slice restrictions you can query:

SELECT * from routes where start = 'New York' and end = 'Washington' and 
(year,month,day) >= (2016,1,1) and (year,month,day) <= (2016,1,31);

For more details about Multi-column slice restrictions read 
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause

Oded

From: Chris Martin [mailto:ch...@cmartinit.co.uk]
Sent: Thursday, March 24, 2016 9:40 AM
To: user@cassandra.apache.org
Subject: Re: DataModelling to query date range

Ah- that looks interesting!  I'm actaully still on cassandra 2.x but I was 
planning on updgrading anyway.  Once I do so I'll check this one out.


Chris


On Thu, Mar 24, 2016 at 2:57 AM, Henry M 
mailto:henrymanm...@gmail.com>> wrote:
I haven't tried the new SASI indexer but it may help: 
https://github.com/apache/cassandra/blob/trunk/doc/SASI.md


On Wed, Mar 23, 2016 at 2:08 PM, Chris Martin 
mailto:ch...@cmartinit.co.uk>> wrote:
Hi all,

I have a table that represents a train timetable and looks a bit like this:


CREATE TABLE routes (

start text,

end text,

validFrom timestamp,

validTo timestamp,

PRIMARY KEY (start, end, validFrom, validTo)

);

In this case validFrom is the date that the route becomes valid and validTo is 
the date that the route that stops becoming valid.

If this was SQL I could write a query to find all valid routes between New York 
and Washington from Jan 1st 2016 to Jan 31st 2016 using something like:

SELECT * from routes where start = 'New York' and end = 'Washington' and 
validFrom <= 2016-01-31 and validTo >= 2016-01-01.

As far as I can tell such a query is impossible with CQL and my current table 
structure.  I'm considering running a query like:

SELECT * from routes where start = 'New York' and end = 'Washington' and 
validFrom <= 2016-01-31
And then filtering the rest of the data app side.  This doesn't seem ideal 
though as I'm going to end up fetching much more data (probably around an order 
of magnitude more) from Cassandra than I really want.

Is there a better way to model the data?

thanks,

Chris









Re: Large number of tombstones without delete or update

2016-03-24 Thread Ralf Steppacher
I did some more tests with my particular schema/message structure:

A null text field inside a UDT instance does NOT yield tombstones.
A null map does NOT yield tombstones.
A null text field does yield tombstones.


Ralf

> On 24.03.2016, at 09:42, Ralf Steppacher  wrote:
> 
> I can confirm that if I send JSON messages that always cover all schema 
> fields the tombstone issue is not reported by Cassandra.
> So, is there a way to work around this issue other than to always populate 
> every column of the schema with every insert? That would be a pain in the 
> backside, really.
> 
> Why would C* not warn about the excessive number of tombstones if queried 
> from cqlsh?
> 
> 
> Thanks!
> Ralf
> 
> 
> 
>> On 23.03.2016, at 19:09, Robert Coli > > wrote:
>> 
>> On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher > > wrote:
>> How come I end up with that large a number of tombstones?
>> 
>> Are you inserting NULLs?
>> 
>> =Rob
>>  
> 



RE: Large number of tombstones without delete or update

2016-03-24 Thread Peer, Oded
http://www.datastax.com/dev/blog/datastax-java-driver-3-0-0-released#unset-values

"For Protocol V3 or below, all variables in a statement must be bound. With 
Protocol V4, variables can be left "unset", in which case they will be ignored 
server-side (no tombstones will be generated)."


From: Ralf Steppacher [mailto:ralf.viva...@gmail.com]
Sent: Thursday, March 24, 2016 11:19 AM
To: user@cassandra.apache.org
Subject: Re: Large number of tombstones without delete or update

I did some more tests with my particular schema/message structure:

A null text field inside a UDT instance does NOT yield tombstones.
A null map does NOT yield tombstones.
A null text field does yield tombstones.


Ralf

On 24.03.2016, at 09:42, Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:

I can confirm that if I send JSON messages that always cover all schema fields 
the tombstone issue is not reported by Cassandra.
So, is there a way to work around this issue other than to always populate 
every column of the schema with every insert? That would be a pain in the 
backside, really.

Why would C* not warn about the excessive number of tombstones if queried from 
cqlsh?


Thanks!
Ralf



On 23.03.2016, at 19:09, Robert Coli 
mailto:rc...@eventbrite.com>> wrote:

On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:
How come I end up with that large a number of tombstones?

Are you inserting NULLs?

=Rob





Re: Large number of tombstones without delete or update

2016-03-24 Thread Ralf Steppacher
How does this improvement apply to inserting JSON? The prepared statement has 
exactly one parameter and it is always bound to the JSON message:

INSERT INTO event_by_patient_timestamp JSON ?

How would I “unset” a field inside the JSON message written to the 
event_by_patient_timestamp table?


Ralf


> On 24.03.2016, at 10:22, Peer, Oded  wrote:
> 
> http://www.datastax.com/dev/blog/datastax-java-driver-3-0-0-released#unset-values
>  
> 
>  
> “For Protocol V3 or below, all variables in a statement must be bound. With 
> Protocol V4, variables can be left “unset”, in which case they will be 
> ignored server-side (no tombstones will be generated).”
>  
>  
> From: Ralf Steppacher [mailto:ralf.viva...@gmail.com] 
> Sent: Thursday, March 24, 2016 11:19 AM
> To: user@cassandra.apache.org
> Subject: Re: Large number of tombstones without delete or update
>  
> I did some more tests with my particular schema/message structure:
>  
> A null text field inside a UDT instance does NOT yield tombstones.
> A null map does NOT yield tombstones.
> A null text field does yield tombstones.
>  
>  
> Ralf
>  
> On 24.03.2016, at 09:42, Ralf Steppacher  > wrote:
>  
> I can confirm that if I send JSON messages that always cover all schema 
> fields the tombstone issue is not reported by Cassandra.
> So, is there a way to work around this issue other than to always populate 
> every column of the schema with every insert? That would be a pain in the 
> backside, really.
>  
> Why would C* not warn about the excessive number of tombstones if queried 
> from cqlsh?
>  
>  
> Thanks!
> Ralf
>  
>  
>  
> On 23.03.2016, at 19:09, Robert Coli  > wrote:
>  
> On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher  > wrote:
> How come I end up with that large a number of tombstones?
>  
> Are you inserting NULLs?
>  
> =Rob



Re: Large number of tombstones without delete or update

2016-03-24 Thread Jean Tremblay
Ralf,

Are you using protocol V4?
How do you measure if a tombstone was generated?

Thanks

Jean

On 24 Mar 2016, at 10:35 , Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:

How does this improvement apply to inserting JSON? The prepared statement has 
exactly one parameter and it is always bound to the JSON message:

INSERT INTO event_by_patient_timestamp JSON ?

How would I “unset” a field inside the JSON message written to the 
event_by_patient_timestamp table?


Ralf


On 24.03.2016, at 10:22, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:

http://www.datastax.com/dev/blog/datastax-java-driver-3-0-0-released#unset-values

“For Protocol V3 or below, all variables in a statement must be bound. With 
Protocol V4, variables can be left “unset”, in which case they will be ignored 
server-side (no tombstones will be generated).”


From: Ralf Steppacher [mailto:ralf.viva...@gmail.com]
Sent: Thursday, March 24, 2016 11:19 AM
To: user@cassandra.apache.org
Subject: Re: Large number of tombstones without delete or update

I did some more tests with my particular schema/message structure:

A null text field inside a UDT instance does NOT yield tombstones.
A null map does NOT yield tombstones.
A null text field does yield tombstones.


Ralf

On 24.03.2016, at 09:42, Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:

I can confirm that if I send JSON messages that always cover all schema fields 
the tombstone issue is not reported by Cassandra.
So, is there a way to work around this issue other than to always populate 
every column of the schema with every insert? That would be a pain in the 
backside, really.

Why would C* not warn about the excessive number of tombstones if queried from 
cqlsh?


Thanks!
Ralf



On 23.03.2016, at 19:09, Robert Coli 
mailto:rc...@eventbrite.com>> wrote:

On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:
How come I end up with that large a number of tombstones?

Are you inserting NULLs?

=Rob




Re: Large number of tombstones without delete or update

2016-03-24 Thread Ralf Steppacher
Jean,

yes, I am using the native protocol v4 (auto-negotiated between java driver 
3.0.0 and C* 2.2.4, verified by logging 
cluster.getConfiguration().getProtocolOptions().getProtocolVersion() ).

My first approach for testing for tombstones was “brute force”. Add records and 
soon enough (after about 2000 records were inserted) a query for the row count 
would yield warnings in the C* log:

WARN  [SharedPool-Worker-2] 2016-03-23 16:54:43,134 SliceQueryFilter.java:307 - 
Read 2410 live and 4820 tombstone cells in 
event_log_system.event_by_patient_timestamp for key: 100013866046895035 (see 
tombstone_warn_threshold). 5000 columns were requested, slices=[2040-06-02 
05\:57+0200:!-]



I then added query trace logging for the count query. I dropped the whole 
keyspace, inserted a single JSON message and then issued the count query:

Host (queried): velcassandra/10.211.55.8:9042
Host (tried): velcassandra/10.211.55.8:9042
Trace id: de69df90-f1a0-11e5-a558-f3993541f01b

activity   | timestamp| source | 
source_elapsed
---+--++--
Executing single-partition query on event_by_patient_timestamp | 10:14:53.324 | 
/127.0.0.1 | 2815
  Acquiring sstable references | 10:14:53.324 | /127.0.0.1 | 
3036
   Merging memtable tombstones | 10:14:53.324 | /127.0.0.1 | 
3097
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
10:14:53.324 | /127.0.0.1 | 3153
Merging data from memtables and 0 sstables | 10:14:53.324 | /127.0.0.1 |
 3170
 Read 1 live and 0 tombstone cells | 10:14:53.324 | /127.0.0.1 | 
3202

Note the last line states that 0 tombstone cells were read. That was after I 
made sure I had no null text fields in the JSON message. With missing/null JSON 
fields (mapped to columns of type text) the trace always reported >= 1 read 
tombstone cells.

I used the version yielding 0 read tombstones again in the brute force test and 
Cassandra never logged any warnings.

Is this a valid test?


Ralf


> On 24.03.2016, at 10:46, Jean Tremblay  
> wrote:
> 
> Ralf,
> 
> Are you using protocol V4?
> How do you measure if a tombstone was generated?
> 
> Thanks
> 
> Jean



RE: Large number of tombstones without delete or update

2016-03-24 Thread Peer, Oded
You are right, I missed the JSON part.
According to the 
docs 
“Columns which are omitted from the JSON value map are treated as a null insert 
(which results in an existing value being deleted, if one is present).”
So “unset” doesn’t help you out.
You can open a Jira ticket asking for “unset” support with JSON values and 
omitted columns so you can control is omitted columns have a “null” value or an 
“unset” value.




From: Ralf Steppacher [mailto:ralf.viva...@gmail.com]
Sent: Thursday, March 24, 2016 11:36 AM
To: user@cassandra.apache.org
Subject: Re: Large number of tombstones without delete or update

How does this improvement apply to inserting JSON? The prepared statement has 
exactly one parameter and it is always bound to the JSON message:

INSERT INTO event_by_patient_timestamp JSON ?

How would I “unset” a field inside the JSON message written to the 
event_by_patient_timestamp table?


Ralf


On 24.03.2016, at 10:22, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:

http://www.datastax.com/dev/blog/datastax-java-driver-3-0-0-released#unset-values

“For Protocol V3 or below, all variables in a statement must be bound. With 
Protocol V4, variables can be left “unset”, in which case they will be ignored 
server-side (no tombstones will be generated).”


From: Ralf Steppacher [mailto:ralf.viva...@gmail.com]
Sent: Thursday, March 24, 2016 11:19 AM
To: user@cassandra.apache.org
Subject: Re: Large number of tombstones without delete or update

I did some more tests with my particular schema/message structure:

A null text field inside a UDT instance does NOT yield tombstones.
A null map does NOT yield tombstones.
A null text field does yield tombstones.


Ralf

On 24.03.2016, at 09:42, Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:

I can confirm that if I send JSON messages that always cover all schema fields 
the tombstone issue is not reported by Cassandra.
So, is there a way to work around this issue other than to always populate 
every column of the schema with every insert? That would be a pain in the 
backside, really.

Why would C* not warn about the excessive number of tombstones if queried from 
cqlsh?


Thanks!
Ralf



On 23.03.2016, at 19:09, Robert Coli 
mailto:rc...@eventbrite.com>> wrote:

On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:
How come I end up with that large a number of tombstones?

Are you inserting NULLs?

=Rob



Re: Large number of tombstones without delete or update

2016-03-24 Thread Jean Tremblay
Ralf,

Thank YOU very much Ralf. You are the first one who could finally shed some 
light on something I observed, but I could not put my finger on what exactly is 
causing my Tombstones.
I cannot judge your method for evaluating the amount of tombstone. It seems 
valid to me.

Jean
On 24 Mar 2016, at 11:10 , Ralf Steppacher 
mailto:ralf.viva...@gmail.com>> wrote:

Jean,

yes, I am using the native protocol v4 (auto-negotiated between java driver 
3.0.0 and C* 2.2.4, verified by logging 
cluster.getConfiguration().getProtocolOptions().getProtocolVersion() ).

My first approach for testing for tombstones was “brute force”. Add records and 
soon enough (after about 2000 records were inserted) a query for the row count 
would yield warnings in the C* log:

WARN  [SharedPool-Worker-2] 2016-03-23 16:54:43,134 SliceQueryFilter.java:307 - 
Read 2410 live and 4820 tombstone cells in 
event_log_system.event_by_patient_timestamp for key: 100013866046895035 (see 
tombstone_warn_threshold). 5000 columns were requested, slices=[2040-06-02 
05\:57+0200:!-]



I then added query trace logging for the count query. I dropped the whole 
keyspace, inserted a single JSON message and then issued the count query:

Host (queried): velcassandra/10.211.55.8:9042
Host (tried): velcassandra/10.211.55.8:9042
Trace id: de69df90-f1a0-11e5-a558-f3993541f01b

activity   | timestamp| source | 
source_elapsed
---+--++--
Executing single-partition query on event_by_patient_timestamp | 10:14:53.324 | 
/127.0.0.1 | 2815
  Acquiring sstable references | 10:14:53.324 | /127.0.0.1 | 
3036
   Merging memtable tombstones | 10:14:53.324 | /127.0.0.1 | 
3097
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
10:14:53.324 | /127.0.0.1 | 3153
Merging data from memtables and 0 sstables | 10:14:53.324 | /127.0.0.1 |
 3170
 Read 1 live and 0 tombstone cells | 10:14:53.324 | /127.0.0.1 | 
3202

Note the last line states that 0 tombstone cells were read. That was after I 
made sure I had no null text fields in the JSON message. With missing/null JSON 
fields (mapped to columns of type text) the trace always reported >= 1 read 
tombstone cells.

I used the version yielding 0 read tombstones again in the brute force test and 
Cassandra never logged any warnings.

Is this a valid test?


Ralf


On 24.03.2016, at 10:46, Jean Tremblay 
mailto:jean.tremb...@zen-innovations.com>> 
wrote:

Ralf,

Are you using protocol V4?
How do you measure if a tombstone was generated?

Thanks

Jean




Re: Large number of tombstones without delete or update

2016-03-24 Thread Ralf Steppacher
Done: https://issues.apache.org/jira/browse/CASSANDRA-11424 


Thanks!
Ralf

> On 24.03.2016, at 11:17, Peer, Oded  wrote:
> 
> You are right, I missed the JSON part.
> According to the docs 
>  
> “Columns which are omitted from the JSON value map are treated as a null 
> insert (which results in an existing value being deleted, if one is present).”
> So “unset” doesn’t help you out.
> You can open a Jira ticket asking for “unset” support with JSON values and 
> omitted columns so you can control is omitted columns have a “null” value or 
> an “unset” value.
>  
>  
>  
>  
> From: Ralf Steppacher [mailto:ralf.viva...@gmail.com] 
> Sent: Thursday, March 24, 2016 11:36 AM
> To: user@cassandra.apache.org
> Subject: Re: Large number of tombstones without delete or update
>  
> How does this improvement apply to inserting JSON? The prepared statement has 
> exactly one parameter and it is always bound to the JSON message:
>  
> INSERT INTO event_by_patient_timestamp JSON ?
>  
> How would I “unset” a field inside the JSON message written to the 
> event_by_patient_timestamp table?
>  
>  
> Ralf
>  
>  
> On 24.03.2016, at 10:22, Peer, Oded  > wrote:
>  
> http://www.datastax.com/dev/blog/datastax-java-driver-3-0-0-released#unset-values
>  
> 
>  
> “For Protocol V3 or below, all variables in a statement must be bound. With 
> Protocol V4, variables can be left “unset”, in which case they will be 
> ignored server-side (no tombstones will be generated).”
>  
>  
> From: Ralf Steppacher [mailto:ralf.viva...@gmail.com 
> ] 
> Sent: Thursday, March 24, 2016 11:19 AM
> To: user@cassandra.apache.org 
> Subject: Re: Large number of tombstones without delete or update
>  
> I did some more tests with my particular schema/message structure:
>  
> A null text field inside a UDT instance does NOT yield tombstones.
> A null map does NOT yield tombstones.
> A null text field does yield tombstones.
>  
>  
> Ralf
>  
> On 24.03.2016, at 09:42, Ralf Steppacher  > wrote:
>  
> I can confirm that if I send JSON messages that always cover all schema 
> fields the tombstone issue is not reported by Cassandra.
> So, is there a way to work around this issue other than to always populate 
> every column of the schema with every insert? That would be a pain in the 
> backside, really.
>  
> Why would C* not warn about the excessive number of tombstones if queried 
> from cqlsh?
>  
>  
> Thanks!
> Ralf
>  
>  
>  
> On 23.03.2016, at 19:09, Robert Coli  > wrote:
>  
> On Wed, Mar 23, 2016 at 9:50 AM, Ralf Steppacher  > wrote:
> How come I end up with that large a number of tombstones?
>  
> Are you inserting NULLs?
>  
> =Rob



StatusLogger output

2016-03-24 Thread Vasileios Vlachos
Hello,

Environment:
- Cassandra 2.0.17, 8 nodes, 4 per DC
- Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)

Every node seems to be dropping messages (anywhere from 10 to 300) twice a
day. I don't know it this has always been the case, but has definitely been
going for the past month or so. Whenever that happens we get
StatusLogger.java output in the log, which is the state of the node at the
time it dropped messages. This output contains information
similar/identical to nodetool tpstats, but further from that, information
regarding system CF follows as can be seen here: http://ur1.ca/ooan6

How can we use this information to find out what the problem was? I am
specifically referring to the information regarding the system CF. I had a
look in the system tables but I cannot draw anything from that. The output
in the log seems to contain two values (comma separated). What are these
numbers?

I wasn't able to find anything on the web/DataStax docs. Any help would be
greatly appreciated!

Thanks,
Vasilis


Re: StatusLogger output

2016-03-24 Thread Vasileios Vlachos
Just to clarify, I can see line 29 which seems to explain the format (first
number ops, second is data), however I don't know they actually mean.

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
vasileiosvlac...@gmail.com> wrote:

> Hello,
>
> Environment:
> - Cassandra 2.0.17, 8 nodes, 4 per DC
> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>
> Every node seems to be dropping messages (anywhere from 10 to 300) twice a
> day. I don't know it this has always been the case, but has definitely been
> going for the past month or so. Whenever that happens we get
> StatusLogger.java output in the log, which is the state of the node at
> the time it dropped messages. This output contains information
> similar/identical to nodetool tpstats, but further from that, information
> regarding system CF follows as can be seen here: http://ur1.ca/ooan6
>
> How can we use this information to find out what the problem was? I am
> specifically referring to the information regarding the system CF. I had a
> look in the system tables but I cannot draw anything from that. The output
> in the log seems to contain two values (comma separated). What are these
> numbers?
>
> I wasn't able to find anything on the web/DataStax docs. Any help would be
> greatly appreciated!
>
> Thanks,
> Vasilis
>


Re: DataModelling to query date range

2016-03-24 Thread Vidur Malik
Hi Chris,

I had something slightly different in mind. You would treat it as time
series data, and have one record for each of the days the route was valid.
In your case:
start   | end| valid
New York   Washington 2016-01-01
New York   Washington 2016-01-02
New York   Washington ...
New York   Washington ...
New York   Washington 2016-01-31

Now, your queries will work, I imagine. Again, this may look wasteful, but
the whole philosophy behind Cassandra is that data duplication is all good.

On Thursday, 24 March 2016, Chris Martin  wrote:

> Hi Vidur,
>
> I had a go at your solution but the problem is that it doesn't match
> routes which are valid all throughtout the range queried.  For example if I
> have  route that is valid for all of Jan 2016. I will have a table that
> looks something like this:
>
> start   | end| valid
> New York   Washington 2016-01-01
> New York   Washington 2016-01-31
>
> So if I query for ranges that have at least one bound outside Jan (e.g Jan
> 15 - Feb 15) then the query you gave will work fine.  If, however, I query
> for a range that is completely inside Jan e.g all routes valid on Jan 15th,
>  The I think I'll end up with a query like:
>
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> valid <= 2016-01-15 and valid >= 2016-01-15.
>
> which will return 0 results as it would only match routes that have a
> valid of 2016-01-15 exactly.
>
>  thanks,
>
> Chris
>
>
> On Wed, Mar 23, 2016 at 11:19 PM, Vidur Malik  > wrote:
>
>> Flip the problem over. Instead of storing validTo and validFrom, simply
>> store a valid field and partition by (start, end). This may sound wasteful,
>> but disk is cheap:
>>
>> CREATE TABLE routes (
>> start text,
>> end text,
>> valid timestamp,
>> PRIMARY KEY ((start, end), valid)
>> );
>>
>> Now, you can execute something like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> valid <= 2016-01-31 and valid >= 2016-01-01.
>>
>>
>> On Wed, Mar 23, 2016 at 5:08 PM, Chris Martin > > wrote:
>>
>>> Hi all,
>>>
>>> I have a table that represents a train timetable and looks a bit like
>>> this:
>>>
>>> CREATE TABLE routes (
>>> start text,
>>> end text,
>>> validFrom timestamp,
>>> validTo timestamp,
>>> PRIMARY KEY (start, end, validFrom, validTo)
>>> );
>>>
>>> In this case validFrom is the date that the route becomes valid and
>>> validTo is the date that the route that stops becoming valid.
>>>
>>> If this was SQL I could write a query to find all valid routes between
>>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>>> like:
>>>
>>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>>>
>>> As far as I can tell such a query is impossible with CQL and my current
>>> table structure.  I'm considering running a query like:
>>>
>>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>>> validFrom <= 2016-01-31
>>>
>>> And then filtering the rest of the data app side.  This doesn't seem
>>> ideal though as I'm going to end up fetching much more data (probably
>>> around an order of magnitude more) from Cassandra than I really want.
>>>
>>> Is there a better way to model the data?
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> Vidur Malik
>>
>> [image: ShopKeep] 
>>
>> 800.820.9814
>> <8008209814>
>> [image: ShopKeep]  [image:
>> ShopKeep]  [image: ShopKeep]
>> 
>>
>
>

-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814>
[image: ShopKeep]  [image: ShopKeep]
 [image: ShopKeep]



Is this type of counter table definition valid?

2016-03-24 Thread K. Lawson
I want to create a table with wide partitions (or, put another way, a table
which has no value columns (non primary key columns)) that enables the
number of rows in any of its partitions to be efficiently procured. Here is
a simple definition of such a table


CREATE TABLE IF NOT EXISTS test_table
> (
> partitionKeyCol timestamp
> clusteringCol   timeuuid
> partitionRowCountColcounterstatic
> PRIMARY KEY (partitionKeyCol, clusteringCol)
> )


The problem with this definition, and others structured like it, is that
their validity cannot be clearly deduced from the information contained in
the docs.


*What the docs do state* (with regards to counters):

   - A counter column can neither be specified as part of a table's PRIMARY
   KEY, nor used to create an INDEX
   - A counter column can only be defined in a dedicated counter table
   (which I take to be a table which solely has counter columns defined as its
   value columns)


*What the docs do not state* (with regards to counters):

   - The ability of a table to have a static counter column defined for it
   (given the unique write path of counters, I feel that this is worth
   mentioning)
   - The ability of a table, which has zero value columns defined for it
   (making it a dedicated counter table, given my understanding of the term),
   to also have a static counter column defined for it

Given the information on this subject that is present in (and absent from)
the docs, such a definition appears to be valid. However, I'm not sure how
that is possible, given that the updates to partitionRowCountCol would
require use of a write path different from that used to insert
(partitionKeyCol, clusteringCol) tuples.

Is this type of counter table definition valid? If so, how are writes to
the table carried out?


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Jack Krupansky
What CL do you read and write with?

Normally, RF=2 is not recommended since it doesn't give you HA within a
data center - there is no way to achieve quorum in the data center if a
node goes down.

I suppose you can achieve a quorum if your request is spread across all
three data centers, but normally apps try to issue requests to a local data
center for performance. Having to ping all data centers on all requests to
achieve a quorum seems a bit excessive.

Can you advise us on your thinking when you selected RF=2?


-- Jack Krupansky

On Thu, Mar 24, 2016 at 2:17 AM, Dikang Gu  wrote:

> Hello there,
>
> We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
> have 6 nodes, across three different regions, and in each region, the
> replication factor is 2. Basically, each nodes holds a full copy of the
> data.
>
> When are doing 30k/s counter increment/decrement per node, and at the
> meanwhile, we are double writing to our mysql tier, so that we can measure
> the accuracy of C* counter, compared to mysql.
>
> The experiment result was great at the beginning, the counter value in C*
> and mysql are very close. The difference is less than 0.1%.
>
> But when we start to run the repair on one node, the counter value in C*
> become much less than the value in mysql,  the difference becomes larger
> than 1%.
>
> My question is that is it a known problem that the counter value will
> become under-counted if repair is running? Should we avoid running repair
> for counter tables?
>
> Thanks.
>
> --
> Dikang
>
>


Re: Query regarding CassandraJavaRDD while running spark job on cassandra

2016-03-24 Thread Kai Wang
I suggest you post this to spark-cassandra-connector list.

On Sat, Mar 12, 2016 at 12:52 AM, Siddharth Verma <
verma.siddha...@snapdeal.com> wrote:

> In cassandra I have a table with the following schema.
>
> CREATE TABLE my_keyspace.my_table1 (
> col_1 text,
> col_2 text,
> col_3 text,
> col_4 text,,
> col_5 text,
> col_6 text,
> col_7 text,
> PRIMARY KEY (col_1, col_2, col_3)
> ) WITH CLUSTERING ORDER BY (col_2 ASC, col_3 ASC);
>
> For processing I create a spark job.
>
> CassandraJavaRDD data1 =
> function.cassandraTable("my_keyspace", "my_table1")
>
>
> 1. Does it guarantee mutual exclusivity of fetched rows across all RDDs
> which are on worker nodes?
> (At the cost of redundancy and verbosity, I will reiterate.
> Suppose I have an entry in the table : ('1','2','3','4','5','6','7')
> What I mean to ask is, when I perform transformations/actions on data1
> RDD), can I be sure that the above entry will be present on ONLY ONE worker
> node?)
>
> 2. All the data pertaining to one partition will be on one node?
> (Suppose I have the following entries in the table :
> ('p1','c2_1','c3_1','4','5','6','7')
> ('p1','c2_2','c3'_2,'4','5','6','7')
> ('p1','c2_3','c3_3','4','5','6','7')
> ('p1','c2_4','c3_4','4','5','6','7')
> ('p1' )
> ('p1' )
> ('p1' )
> All the data for the same partition will be present on only one node?
> )
>
> 3. If i have a DC specifically for analytics, and I place the spark worker
> on the same machines as cassandra node, for that entire DC.
> Can I make sure that the spark worker fetches the data from the token
> range present on that node? (I.E. the node does't fetch data present on
> different node)
> 3.1 (as with the above statement which doesn't have a 'where' clause).
> 3.2 (as with the above statement which has a 'where' clause).
>


Re: Is this type of counter table definition valid?

2016-03-24 Thread DuyHai Doan
Just tested against C* 3.4

CREATE TABLE IF NOT EXISTS test_table (
  part timestamp,
  clust timestamp,
  count counter static,
PRIMARY KEY(part, clust));

and it just works.

"However, I'm not sure how that is possible, given that the updates to
partitionRowCountCol would require use of a write path different from that
used to insert (partitionKeyCol, clusteringCol) tuples."

--> INSERT is not allowed for counter columns, only UDPATE is possible. In
general mutation on static columns only require partition key, and it makes
no difference for static counter columns




On Thu, Mar 24, 2016 at 2:10 PM, K. Lawson  wrote:

>
> I want to create a table with wide partitions (or, put another way, a
> table which has no value columns (non primary key columns)) that enables
> the number of rows in any of its partitions to be efficiently procured.
> Here is a simple definition of such a table
>
>
> CREATE TABLE IF NOT EXISTS test_table
>> (
>> partitionKeyCol timestamp
>> clusteringCol   timeuuid
>> partitionRowCountColcounterstatic
>> PRIMARY KEY (partitionKeyCol, clusteringCol)
>> )
>
>
> The problem with this definition, and others structured like it, is that
> their validity cannot be clearly deduced from the information contained in
> the docs.
>
>
> *What the docs do state* (with regards to counters):
>
>- A counter column can neither be specified as part of a table's
>PRIMARY KEY, nor used to create an INDEX
>- A counter column can only be defined in a dedicated counter table
>(which I take to be a table which solely has counter columns defined as its
>value columns)
>
>
> *What the docs do not state* (with regards to counters):
>
>- The ability of a table to have a static counter column defined for
>it (given the unique write path of counters, I feel that this is worth
>mentioning)
>- The ability of a table, which has zero value columns defined for it
>(making it a dedicated counter table, given my understanding of the term),
>to also have a static counter column defined for it
>
> Given the information on this subject that is present in (and absent from)
> the docs, such a definition appears to be valid. However, I'm not sure how
> that is possible, given that the updates to partitionRowCountCol would
> require use of a write path different from that used to insert
> (partitionKeyCol, clusteringCol) tuples.
>
> Is this type of counter table definition valid? If so, how are writes to
> the table carried out?
>


Client drivers

2016-03-24 Thread Rakesh Kumar
Is it possible to install multiple versions of language drivers on the
client machines. This will be typically useful during an upgrade
process, where by fallback to the old version can be easy.

thanks.


RE: StatusLogger output

2016-03-24 Thread SEAN_R_DURITY
I am not sure the status logger output helps determine the problem. However, 
the dropped mutations and the status logger output is what I see when there is 
too high of a load on one or more Cassandra nodes. It could be long GC pauses, 
something reading too much data (a large row or a multi-partition query), or 
just too many requests for the number of nodes you have. Are you using 
OpsCenter to monitor the rings? Do you have read or write spikes at the time? 
Any GC messages in the log. Any nodes going down at the time?


Sean Durity

From: Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com]
Sent: Thursday, March 24, 2016 8:13 AM
To: user@cassandra.apache.org
Subject: Re: StatusLogger output

Just to clarify, I can see line 29 which seems to explain the format (first 
number ops, second is data), however I don't know they actually mean.

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos 
mailto:vasileiosvlac...@gmail.com>> wrote:
Hello,

Environment:
- Cassandra 2.0.17, 8 nodes, 4 per DC
- Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)

Every node seems to be dropping messages (anywhere from 10 to 300) twice a day. 
I don't know it this has always been the case, but has definitely been going 
for the past month or so. Whenever that happens we get StatusLogger.java output 
in the log, which is the state of the node at the time it dropped messages. 
This output contains information similar/identical to nodetool tpstats, but 
further from that, information regarding system CF follows as can be seen here: 
http://ur1.ca/ooan6

How can we use this information to find out what the problem was? I am 
specifically referring to the information regarding the system CF. I had a look 
in the system tables but I cannot draw anything from that. The output in the 
log seems to contain two values (comma separated). What are these numbers?

I wasn't able to find anything on the web/DataStax docs. Any help would be 
greatly appreciated!

Thanks,
Vasilis




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Client drivers

2016-03-24 Thread Jonathan Haddad
Every language has a different means of working with dependencies.  Some
are compiled in (java, c), some are pulled in via libraries (python).
You'll have to be more specific.

On Thu, Mar 24, 2016 at 8:14 AM Rakesh Kumar 
wrote:

> Is it possible to install multiple versions of language drivers on the
> client machines. This will be typically useful during an upgrade
> process, where by fallback to the old version can be easy.
>
> thanks.
>


Re: Client drivers

2016-03-24 Thread Rakesh Kumar
> Every language has a different means of working with dependencies.  Some are
> compiled in (java, c), some are pulled in via libraries (python).  You'll
> have to be more specific.

I am interested mainly in C++ and Java.

Thanks.


Re: StatusLogger output

2016-03-24 Thread Vasileios Vlachos
Thanks for your help Sean,

The reason StatusLogger messages appear in the logs is usually, as you
said, a GC pause (ParNew or CMS, I have seen both), or dropped messages. In
our case dropped messages are always (so far) due to internal timeouts, not
due to cross node timeouts (like the sample output in the link I provided
earlier). I have seen StatusLogger output during low traffic times and I
cannot say that we seem to have more logs during high-traffic hours.

We use Nagios for monitoring and have several checks for cassandra (we use
the JMX console for each node). However, most graphs are averaged out. I
can see some spikes at the times, however, these spikes only go around
20-30% of the load we get during high-traffic times. The only time we have
seen nodes marked down in the logs is when there is some severe cross-DC
VPN issue, which is not something that happens often and does not correlate
with StatusLogger output either.

Regarding GC, we only see up to 10 GC pauses per day in the logs (I ofc
mean over 200ms which is the threshold for logging GC events by default).
We are actually experimenting with GC these days on one of the nodes, but I
cannot say this has made things worse/better.

I was hoping that by understanding the StatusLogger output better I'd be
able to investigate further. We monitor metrics like hints, pending tasks,
reads/writes per CF, read/write latency/CF, compactions, connections/node.
If there is anything from the JMX console that you would suggest I should
be monitoring, please let me know. I was thinking compactions may be the
reason for this (so, I/O could be the bottleneck), but looking at the
graphs I can see that when a node compacts its CPU usage would only max at
around 20-30% and would only add 2-5ms of read/write latency per CF (if
any).

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 3:31 PM,  wrote:

> I am not sure the status logger output helps determine the problem.
> However, the dropped mutations and the status logger output is what I see
> when there is too high of a load on one or more Cassandra nodes. It could
> be long GC pauses, something reading too much data (a large row or a
> multi-partition query), or just too many requests for the number of nodes
> you have. Are you using OpsCenter to monitor the rings? Do you have read or
> write spikes at the time? Any GC messages in the log. Any nodes going down
> at the time?
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com]
> *Sent:* Thursday, March 24, 2016 8:13 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: StatusLogger output
>
>
>
> Just to clarify, I can see line 29 which seems to explain the format
> (first number ops, second is data), however I don't know they actually
> mean.
>
>
>
> Thanks,
>
> Vasilis
>
>
>
> On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
> vasileiosvlac...@gmail.com> wrote:
>
> Hello,
>
>
>
> Environment:
>
> - Cassandra 2.0.17, 8 nodes, 4 per DC
>
> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>
>
>
> Every node seems to be dropping messages (anywhere from 10 to 300) twice a
> day. I don't know it this has always been the case, but has definitely been
> going for the past month or so. Whenever that happens we get
> StatusLogger.java output in the log, which is the state of the node at
> the time it dropped messages. This output contains information
> similar/identical to nodetool tpstats, but further from that, information
> regarding system CF follows as can be seen here: http://ur1.ca/ooan6
>
>
>
> How can we use this information to find out what the problem was? I am
> specifically referring to the information regarding the system CF. I had a
> look in the system tables but I cannot draw anything from that. The output
> in the log seems to contain two values (comma separated). What are these
> numbers?
>
>
>
> I wasn't able to find anything on the web/DataStax docs. Any help would be
> greatly appreciated!
>
>
>
> Thanks,
>
> Vasilis
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or specia

Re: Counter values become under-counted when running repair.

2016-03-24 Thread Dikang Gu
@Jack, we write to 2 and read from 1.

I do not understand why RF=2 matters here, will it have impact on the
repair? Can you please explain more?

I select RF=2 in each region, because:
1. all 2 writes will be sent to local region, so we do not need to wait for
the response across region.
2. if one node has problem in local region, the read can still hit the
other one in local region as well.

However, I can change the RF if it's really the cause of the under-counting.

Thanks
Dikang.


On Thu, Mar 24, 2016 at 7:17 AM, Jack Krupansky 
wrote:

> What CL do you read and write with?
>
> Normally, RF=2 is not recommended since it doesn't give you HA within a
> data center - there is no way to achieve quorum in the data center if a
> node goes down.
>
> I suppose you can achieve a quorum if your request is spread across all
> three data centers, but normally apps try to issue requests to a local data
> center for performance. Having to ping all data centers on all requests to
> achieve a quorum seems a bit excessive.
>
> Can you advise us on your thinking when you selected RF=2?
>
>
> -- Jack Krupansky
>
> On Thu, Mar 24, 2016 at 2:17 AM, Dikang Gu  wrote:
>
>> Hello there,
>>
>> We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
>> have 6 nodes, across three different regions, and in each region, the
>> replication factor is 2. Basically, each nodes holds a full copy of the
>> data.
>>
>> When are doing 30k/s counter increment/decrement per node, and at the
>> meanwhile, we are double writing to our mysql tier, so that we can measure
>> the accuracy of C* counter, compared to mysql.
>>
>> The experiment result was great at the beginning, the counter value in C*
>> and mysql are very close. The difference is less than 0.1%.
>>
>> But when we start to run the repair on one node, the counter value in C*
>> become much less than the value in mysql,  the difference becomes larger
>> than 1%.
>>
>> My question is that is it a known problem that the counter value will
>> become under-counted if repair is running? Should we avoid running repair
>> for counter tables?
>>
>> Thanks.
>>
>> --
>> Dikang
>>
>>
>


-- 
Dikang


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Robert Coli
On Thu, Mar 24, 2016 at 7:17 AM, Jack Krupansky 
wrote:

> Can you advise us on your thinking when you selected RF=2?
>

I figure he was probably thinking "I want to operate in a bunch of
different regions and don't need to use QUORUM for my use cases, and want
to save money by not storing 3 copies per DC" ... ?

If you don't need QUORUM and can tolerate cross-DC reads in the
local-range-unavailable (two nodes down at once) case, there is nothing
especially magical about RF=3...

=Rob


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Aleksey Yeschenko
After repair is over, does the value settle? What CLs do you write to your 
counters with? What CLs are you reading with?

-- 
AY

On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:

Hello there,  

We are experimenting Counters in Cassandra 2.2.5. Our setup is that we have  
6 nodes, across three different regions, and in each region, the  
replication factor is 2. Basically, each nodes holds a full copy of the  
data.  

When are doing 30k/s counter increment/decrement per node, and at the  
meanwhile, we are double writing to our mysql tier, so that we can measure  
the accuracy of C* counter, compared to mysql.  

The experiment result was great at the beginning, the counter value in C*  
and mysql are very close. The difference is less than 0.1%.  

But when we start to run the repair on one node, the counter value in C*  
become much less than the value in mysql, the difference becomes larger  
than 1%.  

My question is that is it a known problem that the counter value will  
become under-counted if repair is running? Should we avoid running repair  
for counter tables?  

Thanks.  

--  
Dikang  


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Dikang Gu
@Aleksey, we are writing to cluster with CL = 2, and reading with CL = 1.
And overall we have 6 copies across 3 different regions. Do you have
comments about our setup?

During the repair, the counter value become inaccurate, we are still
playing with the repair, will keep you update with more experiments. But do
you have any theory around that?

Thanks a lot!
Dikang.

On Thu, Mar 24, 2016 at 11:02 AM, Aleksey Yeschenko 
wrote:

> After repair is over, does the value settle? What CLs do you write to your
> counters with? What CLs are you reading with?
>
> --
> AY
>
> On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:
>
> Hello there,
>
> We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
> have
> 6 nodes, across three different regions, and in each region, the
> replication factor is 2. Basically, each nodes holds a full copy of the
> data.
>
> When are doing 30k/s counter increment/decrement per node, and at the
> meanwhile, we are double writing to our mysql tier, so that we can measure
> the accuracy of C* counter, compared to mysql.
>
> The experiment result was great at the beginning, the counter value in C*
> and mysql are very close. The difference is less than 0.1%.
>
> But when we start to run the repair on one node, the counter value in C*
> become much less than the value in mysql, the difference becomes larger
> than 1%.
>
> My question is that is it a known problem that the counter value will
> become under-counted if repair is running? Should we avoid running repair
> for counter tables?
>
> Thanks.
>
> --
> Dikang
>
>


-- 
Dikang


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Aleksey Yeschenko
Best open a JIRA ticket and I’ll have a look at what could be the reason.

-- 
AY

On 24 March 2016 at 23:20:55, Dikang Gu (dikan...@gmail.com) wrote:

@Aleksey, we are writing to cluster with CL = 2, and reading with CL = 1.  
And overall we have 6 copies across 3 different regions. Do you have  
comments about our setup?  

During the repair, the counter value become inaccurate, we are still  
playing with the repair, will keep you update with more experiments. But do  
you have any theory around that?  

Thanks a lot!  
Dikang.  

On Thu, Mar 24, 2016 at 11:02 AM, Aleksey Yeschenko   
wrote:  

> After repair is over, does the value settle? What CLs do you write to your  
> counters with? What CLs are you reading with?  
>  
> --  
> AY  
>  
> On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:  
>  
> Hello there,  
>  
> We are experimenting Counters in Cassandra 2.2.5. Our setup is that we  
> have  
> 6 nodes, across three different regions, and in each region, the  
> replication factor is 2. Basically, each nodes holds a full copy of the  
> data.  
>  
> When are doing 30k/s counter increment/decrement per node, and at the  
> meanwhile, we are double writing to our mysql tier, so that we can measure  
> the accuracy of C* counter, compared to mysql.  
>  
> The experiment result was great at the beginning, the counter value in C*  
> and mysql are very close. The difference is less than 0.1%.  
>  
> But when we start to run the repair on one node, the counter value in C*  
> become much less than the value in mysql, the difference becomes larger  
> than 1%.  
>  
> My question is that is it a known problem that the counter value will  
> become under-counted if repair is running? Should we avoid running repair  
> for counter tables?  
>  
> Thanks.  
>  
> --  
> Dikang  
>  
>  


--  
Dikang  


Data export with consistency problem

2016-03-24 Thread xutom
Hi all,
I have a C* cluster with five nodes and my cassandra version is 2.1.1 and 
we also enable "Hinted Handoff" . Everything is fine while we use C* cluster to 
store up to 10 billion rows of data. But now we have a problem. During our 
test, after we import up to 40 billion rows of data into C* cluster, we 
manually remove the network cable of one node(eg: there are 5 nodes, and we 
remove just one network cable of node to simulate minor network problem with C* 
cluster), then we  create another table and import 30 million into this table. 
Before we reconnect the network cable of that node, we export the data of the 
new table, we can export all 30 million rows many times. But after we reconnect 
the network cable, we export the data immediately and we cannot all the 30 
million rows of data. Maybe a fewer minutes later, after the C* cluster balance 
all the datas( my guess) , then we do the exporting , we could export all the 
30 million rows of data.
Is there something wrong with "Hinted Handoff"? Whille coping data from 
coordinator node to the newer incoming node, is the newer node can response the 
client`s request? Thanks in advances!

jerry


datastax java driver Batch vs BatchStatement

2016-03-24 Thread Jimmy Lin
Hi all,
What is the difference between datastax driver Batch and BatchStatement?

In particular, BatchStatment call out that it needs native protocol of
version 2 or above.
What is the advantage using native protocol 2.0  for batch execution?

Will any of these two api smart enough to split a big batch into multiple
smaller one ?
(to avoid batch_size_warn_threshold_in_kb  or
batch_size_failed_threshold_in_kb
)

Thanks

Batch
https://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/querybuilder/Batch.html

BatchStatement
https://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/BatchStatement.html


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Dikang Gu
@Aleksey, sure, here is the jira:
https://issues.apache.org/jira/browse/CASSANDRA-11432

Thanks!

On Thu, Mar 24, 2016 at 5:32 PM, Aleksey Yeschenko 
wrote:

> Best open a JIRA ticket and I’ll have a look at what could be the reason.
>
> --
> AY
>
> On 24 March 2016 at 23:20:55, Dikang Gu (dikan...@gmail.com) wrote:
>
> @Aleksey, we are writing to cluster with CL = 2, and reading with CL = 1.
> And overall we have 6 copies across 3 different regions. Do you have
> comments about our setup?
>
> During the repair, the counter value become inaccurate, we are still
> playing with the repair, will keep you update with more experiments. But
> do
> you have any theory around that?
>
> Thanks a lot!
> Dikang.
>
> On Thu, Mar 24, 2016 at 11:02 AM, Aleksey Yeschenko 
> wrote:
>
> > After repair is over, does the value settle? What CLs do you write to
> your
> > counters with? What CLs are you reading with?
> >
> > --
> > AY
> >
> > On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:
> >
> > Hello there,
> >
> > We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
> > have
> > 6 nodes, across three different regions, and in each region, the
> > replication factor is 2. Basically, each nodes holds a full copy of the
> > data.
> >
> > When are doing 30k/s counter increment/decrement per node, and at the
> > meanwhile, we are double writing to our mysql tier, so that we can
> measure
> > the accuracy of C* counter, compared to mysql.
> >
> > The experiment result was great at the beginning, the counter value in
> C*
> > and mysql are very close. The difference is less than 0.1%.
> >
> > But when we start to run the repair on one node, the counter value in C*
> > become much less than the value in mysql, the difference becomes larger
> > than 1%.
> >
> > My question is that is it a known problem that the counter value will
> > become under-counted if repair is running? Should we avoid running
> repair
> > for counter tables?
> >
> > Thanks.
> >
> > --
> > Dikang
> >
> >
>
>
> --
> Dikang
>
>


-- 
Dikang