Re: Inconsistent count(*) and distinct results from Cassandra

2015-03-12 Thread Rumph, Frens Jan
Hi Jens, Mikhail, Daemeon,

Thanks for your replies. Sorry for my reply being late ... mails from the
user-list were moved to the wrong inbox on my side.

I'm in a development environment and thus using replication factor = 1 and
consistency = ONE with three nodes. So the 'results from different nodes
between queries' hypothesis seems unlikely to me. I would expect a timeout
if some node wouldn't be able to answer.

I tried tracing, but I couldn't really make any of it.

For example I performed two select distinct ... from ... queries: Traces
for both of them contained more than one line like 'Submitting range
requests on ... ranges ...' and 'Submitted ... concurrent range requests
covering ... ranges'. These lines occur with varying numbers, e.g. :

Submitting range requests on 593 ranges with a concurrency of 75 (1.35 rows
per range expected)
Submitting range requests on 769 ranges with a concurrency of 75 (1.35 rows
per range expected)


Also when looking at the lines like 'Executing seq scan across ... sstables
for ...' I saw that in one case which yielded way less partition keys that
only the tokens from -922337203685477  to -594461978511041000 were
included. In a case which yielded much more partition keys, the entire
token range did seem to be queried.

To reiterate my initial questions: is this behavior to be expected? Am I
doing something wrong? Is there a workaround?

Best regards,
Frens Jan

On 4 March 2015 at 22:59, daemeon reiydelle  wrote:

> What is the replication? Could you be serving stale data from a node that
> was not properly replicated (hints timeout exceeded by a node being down?)
>
>
>
> On Wed, Mar 4, 2015 at 11:03 AM, Jens Rantil  wrote:
>
>> Frens,
>>
>> What consistency are you querying with? Could be you are simply receiving
>> result from different nodes each time.
>>
>> Jens
>>
>> –
>> Skickat från Mailbox 
>>
>>
>> On Wed, Mar 4, 2015 at 7:08 PM, Mikhail Strebkov 
>> wrote:
>>
>>> We have observed the same issue in our production Cassandra cluster (5
>>> nodes in one DC). We use Cassandra 2.1.3 (I joined the list too late to
>>> realize we shouldn’t user 2.1.x yet) on Amazon machines (created from
>>> community AMI).
>>>
>>> In addition to count variations with 5 to 10% we observe variations for
>>> the query “select * from table1 where time > '$fromDate' and time <
>>> '$toDate' allow filtering” results. We iterated through the results
>>> multiple times using official Java driver. We used that query for a huge
>>> data migration and were unpleasantly surprised that it is unreliable. In
>>> our case “nodetool repair” didn’t fix the issue.
>>>
>>> So I echo Frens questions.
>>>
>>> Thanks,
>>> Mikhail
>>>
>>>
>>>
>>>
>>> On Wed, Mar 4, 2015 at 3:55 AM, Rumph, Frens Jan 
>>> wrote:
>>>
 Hi,

 Is it to be expected that select count(*) from ... and select distinct
 partition-key-columns from ... to yield inconsistent results between
 executions even though the table at hand isn't written to?

 I have a table in a keyspace with replication_factor = 1 which is
 something like:

  CREATE TABLE tbl (
 id frozen,
 bucket bigint,
 offset int,
 value double,
 PRIMARY KEY ((id, bucket), offset)
 )

 The frozen udt is:

  CREATE TYPE id_type (
 tags map
 );

 When I do select count(*) from tbl several times the actual count
 varies with 5 to 10%. Also when performing select distinct id, bucket from
 tbl the results aren't consistent over several query executions. The table
 is not being written to at the time I performed the queries.

 Is this to be expected? Or is this a bug? Is there a alternative method
 / workaround?

 I'm using cqlsh 5.0.1 with Cassandra 2.1.2 on 64bit fedora 21 with
 Oracle Java 1.8.0_31.

 Thanks in advance,
 Frens Jan

>>>
>>>
>>
>


Re: Inconsistent count(*) and distinct results from Cassandra

2015-03-12 Thread DuyHai Doan
First idea to eliminate any issue with regards to staled data: issue the
same count query with RF=QUORUM and check whether there are still
inconsistencies

On Tue, Mar 10, 2015 at 9:13 AM, Rumph, Frens Jan  wrote:

> Hi Jens, Mikhail, Daemeon,
>
> Thanks for your replies. Sorry for my reply being late ... mails from the
> user-list were moved to the wrong inbox on my side.
>
> I'm in a development environment and thus using replication factor = 1 and
> consistency = ONE with three nodes. So the 'results from different nodes
> between queries' hypothesis seems unlikely to me. I would expect a timeout
> if some node wouldn't be able to answer.
>
> I tried tracing, but I couldn't really make any of it.
>
> For example I performed two select distinct ... from ... queries: Traces
> for both of them contained more than one line like 'Submitting range
> requests on ... ranges ...' and 'Submitted ... concurrent range requests
> covering ... ranges'. These lines occur with varying numbers, e.g. :
>
> Submitting range requests on 593 ranges with a concurrency of 75 (1.35
> rows per range expected)
> Submitting range requests on 769 ranges with a concurrency of 75 (1.35
> rows per range expected)
>
>
> Also when looking at the lines like 'Executing seq scan across ...
> sstables for ...' I saw that in one case which yielded way less partition
> keys that only the tokens from -922337203685477  to -594461978511041000
> were included. In a case which yielded much more partition keys, the entire
> token range did seem to be queried.
>
> To reiterate my initial questions: is this behavior to be expected? Am I
> doing something wrong? Is there a workaround?
>
> Best regards,
> Frens Jan
>
> On 4 March 2015 at 22:59, daemeon reiydelle  wrote:
>
>> What is the replication? Could you be serving stale data from a node that
>> was not properly replicated (hints timeout exceeded by a node being down?)
>>
>>
>>
>> On Wed, Mar 4, 2015 at 11:03 AM, Jens Rantil  wrote:
>>
>>> Frens,
>>>
>>> What consistency are you querying with? Could be you are simply
>>> receiving result from different nodes each time.
>>>
>>> Jens
>>>
>>> –
>>> Skickat från Mailbox 
>>>
>>>
>>> On Wed, Mar 4, 2015 at 7:08 PM, Mikhail Strebkov 
>>> wrote:
>>>
 We have observed the same issue in our production Cassandra cluster (5
 nodes in one DC). We use Cassandra 2.1.3 (I joined the list too late to
 realize we shouldn’t user 2.1.x yet) on Amazon machines (created from
 community AMI).

 In addition to count variations with 5 to 10% we observe variations for
 the query “select * from table1 where time > '$fromDate' and time <
 '$toDate' allow filtering” results. We iterated through the results
 multiple times using official Java driver. We used that query for a huge
 data migration and were unpleasantly surprised that it is unreliable. In
 our case “nodetool repair” didn’t fix the issue.

 So I echo Frens questions.

 Thanks,
 Mikhail




 On Wed, Mar 4, 2015 at 3:55 AM, Rumph, Frens Jan 
 wrote:

> Hi,
>
> Is it to be expected that select count(*) from ... and select distinct
> partition-key-columns from ... to yield inconsistent results between
> executions even though the table at hand isn't written to?
>
> I have a table in a keyspace with replication_factor = 1 which is
> something like:
>
>  CREATE TABLE tbl (
> id frozen,
> bucket bigint,
> offset int,
> value double,
> PRIMARY KEY ((id, bucket), offset)
> )
>
> The frozen udt is:
>
>  CREATE TYPE id_type (
> tags map
> );
>
> When I do select count(*) from tbl several times the actual count
> varies with 5 to 10%. Also when performing select distinct id, bucket from
> tbl the results aren't consistent over several query executions. The table
> is not being written to at the time I performed the queries.
>
> Is this to be expected? Or is this a bug? Is there a alternative
> method / workaround?
>
> I'm using cqlsh 5.0.1 with Cassandra 2.1.2 on 64bit fedora 21 with
> Oracle Java 1.8.0_31.
>
> Thanks in advance,
> Frens Jan
>


>>>
>>
>


Re: Unable to overwrite some rows

2015-03-12 Thread Guðmundur Örn Jóhannsson
That's it. The clock on one of the nodes was way off. Thanks!!

--
regards,
Gudmundur Johannsson


On Wed, Mar 11, 2015 at 3:42 PM, Roland Etzenhammer <
r.etzenham...@t-online.de> wrote:

> Hi,
>
> I think that your clocks are not in sync. Do you have ntp on all your
> nodes up and running with low offset? If not, setup ntp as first probable
> solution. Cassandra relies on accurate clocks on all cluster nodes for it's
> (internal) timestamps.
>
> Do you see any error while writing? Or just while reading?
>
> Cheers,
> Roland
>
>


DataStax Enterprise Amazon AMI Launch Error

2015-03-12 Thread Vanessa Gligor
I'm trying to launch a new instance of DataStax AMI on a EC2 Amazon
instance. I tried this in 2 different regions (us-east and eu-west), using
these AMIs: ami-ada2b6c4, ami-814ec2e8 (us-east) and ami-7f33cd08,
ami-b2212dc6 (eu-west).

I followed this documentation:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

So this is what I've done so far:

1. I've created a new security group (with those specific ports - I cannot
upload the print screen because I have just created this account)

2. I've create a new key pair

3. I've launched the DataStax AMI with these configuration details:
--clustername cluster --totalnodes 4 --version enterprise --username
my_name --password my_password --searchnodes 2 (I have verified my
credentials - I can login here http://debian.datastax.com/enterprise/ )

4. After selecting the previous created security group & key pair I
launched the instance

5. I've connected to my DataStax Enterprise EC2 instance and this is the
displayed log:

Cluster started with these options: --clustername cluster --totalnodes 4
--version enterprise --username my_name --password  --searchnodes 2

03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
[u'172.31.34.171']... Exception seen in ds1_launcher.py. Please check
~/datastax_ami/ami.log for more info. Please visit 


and the ami.log shows these messages:


[INFO] 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
[u'172.31.34.171']
[ERROR] EC2 is experiencing some issues and has not allocated all of
the resources in under 10 minutes.
Aborting the clustering of this reservation. Please try again.
[ERROR] Exception seen in ds1_launcher.py:
Traceback (most recent call last):
File "/home/ubuntu/datastax_ami/ds1_launcher.py", line 22, in
initial_configurations
ds2_configure.run()
 File "/home/ubuntu/datastax_ami/ds2_configure.py", line 1135, in run
File "/home/ubuntu/datastax_ami/ds2_configure.py", line 57, in exit_path
AttributeError: EC2 is experiencing some issues and has not allocated
all of the resources in under 10 minutes.
Aborting the clustering of this reservation. Please try again.

Any suggestion on how to fix this problem?

Thank you!

Have a nice day,

Vanessa.


Re: DataStax Enterprise Amazon AMI Launch Error

2015-03-12 Thread Ali Akhtar
Seems like its having trouble launching the other EC2 instances that you're
requesting. You would need to provide it your AWS credentials for an
account that has the permissions to create EC2 instances. Have you done
that?

If you just want to install cassandra on AWS, you might find this bash
script useful: https://gist.github.com/aliakhtar/3649e412787034156cbb

On Thu, Mar 12, 2015 at 5:14 PM, Vanessa Gligor 
wrote:

> I'm trying to launch a new instance of DataStax AMI on a EC2 Amazon
> instance. I tried this in 2 different regions (us-east and eu-west), using
> these AMIs: ami-ada2b6c4, ami-814ec2e8 (us-east) and ami-7f33cd08,
> ami-b2212dc6 (eu-west).
>
> I followed this documentation:
> http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html
>
> So this is what I've done so far:
>
> 1. I've created a new security group (with those specific ports - I cannot
> upload the print screen because I have just created this account)
>
> 2. I've create a new key pair
>
> 3. I've launched the DataStax AMI with these configuration details:
> --clustername cluster --totalnodes 4 --version enterprise --username
> my_name --password my_password --searchnodes 2 (I have verified my
> credentials - I can login here http://debian.datastax.com/enterprise/ )
>
> 4. After selecting the previous created security group & key pair I
> launched the instance
>
> 5. I've connected to my DataStax Enterprise EC2 instance and this is the
> displayed log:
>
> Cluster started with these options: --clustername cluster --totalnodes 4
> --version enterprise --username my_name --password  --searchnodes 2
>
> 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
> [u'172.31.34.171']... Exception seen in ds1_launcher.py. Please check
> ~/datastax_ami/ami.log for more info. Please visit 
>
>
> and the ami.log shows these messages:
>
>
> [INFO] 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from: 
> [u'172.31.34.171']
> [ERROR] EC2 is experiencing some issues and has not allocated all of the 
> resources in under 10 minutes.
> Aborting the clustering of this reservation. Please try again.
> [ERROR] Exception seen in ds1_launcher.py:
> Traceback (most recent call last):
> File "/home/ubuntu/datastax_ami/ds1_launcher.py", line 22, in 
> initial_configurations
> ds2_configure.run()
>  File "/home/ubuntu/datastax_ami/ds2_configure.py", line 1135, in run
> File "/home/ubuntu/datastax_ami/ds2_configure.py", line 57, in exit_path
> AttributeError: EC2 is experiencing some issues and has not allocated all of 
> the resources in under 10 minutes.
> Aborting the clustering of this reservation. Please try again.
>
> Any suggestion on how to fix this problem?
>
> Thank you!
>
> Have a nice day,
>
> Vanessa.
>
>


Re: CQL 3.x Update ...USING TIMESTAMP...

2015-03-12 Thread Eric Stevens
> It's possible, but you'll end up with problems when attempting to
overwrite or delete entries

I'm wondering if you can elucidate on that a little bit, do you just mean
that it's easy to forget to always set your timestamp correctly, and if you
goof it up, it makes it difficult to recover from (i.e. you issue a delete
with system timestamp instead of document version, and that's way larger
than your document version would ever be, so you can never write that
document again)?  Or is there some bug in write timestamps that can cause
the wrong entry to win the write contention?

We're looking at doing something similar to keep a live max value column in
a given table, our setup is as follows:

CREATE TABLE a (
  id ,
  time timestamp,
  max_b_foo int,
  PRIMARY KEY (id)
);
CREATE TABLE b (
  b_id ,
  a_id ,
  a_timestamp timestamp,
  foo int,
  PRIMARY KEY (a_id, b_id)
);

The idea being that there's a one-to-many relationship between *a* and *b*.
We want *a* to know what the maximum value is in *b* for field *foo* so we
can avoid reading *all* *b* when we want to resolve *a*. You can see that
we can't just use *b*'s clustering key to resolve that with LIMIT 1; also
this is for DSE Solr, which wouldn't be able to query a by max b.foo
anyway.  So when we write to *b*, we also write to *a* with something like

UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET max_b_foo =
${b.foo} WHERE id = ${b.a_id}

Assuming that we don't run afoul of related antipatterns such as repeatedly
overwriting the same value indefinitely, this strikes me as sound if
unorthodox practice, as long as conflict resolution in Cassandra isn't
broken in some subtle way.  We also designed this to be safe from getting
write timestamps greatly out of sync with clock time so that
non-timestamped operations (especially delete) if done accidentally will
still have a reasonable chance of having the expected results.

So while it may not be the intended use case for write timestamps, and
there are definitely gotchas if you are not careful or misunderstand the
consequences, as far as I can see the logic behind it is sound but does
rely on correct conflict resolution in Cassandra.  I'm curious if I'm
missing or misunderstanding something important.

On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs  wrote:

> Don't use the version as your timestamp.  It's possible, but you'll end up
> with problems when attempting to overwrite or delete entries.
>
> Instead, make the version part of the primary key:
>
> CREATE TABLE document_store (document_id bigint, version int, document
> text, PRIMARY KEY (document_id, version)) WITH CLUSTERING ORDER BY (version
> desc)
>
> That way you don't have to worry about overwriting higher versions with a
> lower one, and to read the latest version, you only have to do:
>
> SELECT * FROM document_store WHERE document_id = ? LIMIT 1;
>
> Another option is to use lightweight transactions (i.e. UPDATE ... SET
> docuement = ?, version = ? WHERE document_id = ? IF version < ?), but
> that's going to make writes much more expensive.
>
> On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam  wrote:
>
>> I am planning to use the Update...USING TIMESTAMP... statement to make
>> sure that I do not overwrite fresh data with stale data while having to
>> avoid doing at least LOCAL_QUORUM writes.
>>
>> Here is my table structure.
>>
>> Table=DocumentStore
>> DocumentID (primaryKey, bigint)
>> Document(text)
>> Version(int)
>>
>> If the service receives 2 write requests with Version=1 and Version=2,
>> regardless of the order of arrival, the business requirement is that we end
>> up with Version=2 in the database.
>>
>> Can I use the following CQL Statement?
>>
>> Update DocumentStore using 
>> SET  Document=,
>> Version=
>> where DocumentID=;
>>
>> Has anybody used something like this? If so was the behavior as expected?
>>
>> Regards
>> Sachin
>>
>
>
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Ajay
Is there a separate forum for Opscenter?

Thanks
Ajay
On 11-Mar-2015 4:16 pm, "Ajay"  wrote:

> Hi,
>
> While adding a Cassandra node using OpsCenter (which is recommended), the
> versions of Cassandra (Datastax community edition) shows only 2.0.9 and not
> later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended
> than 2.0.11?
>
> Thanks
> Ajay
>


Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Nick Bailey
There isn't an OpsCenter specific mailing list no.

To answer your question, the reason OpsCenter provisioning doesn't support
2.0.10 and 2.0.11 is due to
https://issues.apache.org/jira/browse/CASSANDRA-8072.

That bug unfortunately prevents OpsCenter provisioning from working
correctly, but isn't serious outside of provisioning. OpsCenter may be able
to come up with a workaround but at the moment those versions are
unsupported. Sorry for inconvenience.

-Nick

On Thu, Mar 12, 2015 at 9:18 AM, Ajay  wrote:

> Is there a separate forum for Opscenter?
>
> Thanks
> Ajay
> On 11-Mar-2015 4:16 pm, "Ajay"  wrote:
>
>> Hi,
>>
>> While adding a Cassandra node using OpsCenter (which is recommended), the
>> versions of Cassandra (Datastax community edition) shows only 2.0.9 and not
>> later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended
>> than 2.0.11?
>>
>> Thanks
>> Ajay
>>
>


Re: Steps to do after schema changes

2015-03-12 Thread Mark Reddy
It's always good to run "nodetool describecluster" after a schema change,
this will show you all the nodes in your cluster and what schema version
they have. If they have different versions you have a schema disagreement
and should follow this guide to resolution:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html

Regards,
Mark

On 12 March 2015 at 05:47, Phil Yang  wrote:

> Usually, you have nothing to do. Changes will be synced to every nodes
> automatically.
>
> 2015-03-12 13:21 GMT+08:00 Ajay :
>
>> Hi,
>>
>> Are there any steps to do (like nodetool or restart node) or any
>> precautions after schema changes are done in a column family say adding a
>> new column or modifying any table properties?
>>
>> Thanks
>> Ajay
>>
>
>
>
> --
> Thanks,
> Phil Yang
>
>


Re: Stable cassandra build for production usage

2015-03-12 Thread Ajay
Hi,

We did our research using 2.0.11 version. While preparing for the
production deployment, found out the following issues:

1) 2.0.12 has nodetool cleanup issue -
https://issues.apache.org/jira/browse/CASSANDRA-8718
2) 2.0.11 has nodetool issue -
https://issues.apache.org/jira/browse/CASSANDRA-8548
3) OpsCenter 5.1.0 supports only - 2.0.9 and not later 2.0.x -
https://issues.apache.org/jira/browse/CASSANDRA-8072
4) 2.0.9 has schema refresh issue -
https://issues.apache.org/jira/browse/CASSANDRA-7734

Please suggest what is the best option in this for production deployment in
EC2 given that we are deploying Cassandra cluster for the 1st time (so
likely that we add more data centers/nodes and schema changes in the
initial few months)

Thanks
Ajay

On Thu, Jan 1, 2015 at 9:49 PM, Neha Trivedi  wrote:

> Use 2.0.11 for production
>
> On Wed, Dec 31, 2014 at 11:50 PM, Robert Coli 
> wrote:
>
>> On Wed, Dec 31, 2014 at 8:38 AM, Ajay  wrote:
>>
>>> For my research and learning I am using Cassandra 2.1.2. But I see
>>> couple of mail threads going on issues in 2.1.2. So what is the stable or
>>> popular build for production in Cassandra 2.x series.
>>>
>> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>>
>> =Rob
>>
>
>


Re: Steps to do after schema changes

2015-03-12 Thread Ajay
Thanks Mark.

-
Ajay
On 12-Mar-2015 11:08 pm, "Mark Reddy"  wrote:

> It's always good to run "nodetool describecluster" after a schema change,
> this will show you all the nodes in your cluster and what schema version
> they have. If they have different versions you have a schema disagreement
> and should follow this guide to resolution:
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html
>
> Regards,
> Mark
>
> On 12 March 2015 at 05:47, Phil Yang  wrote:
>
>> Usually, you have nothing to do. Changes will be synced to every nodes
>> automatically.
>>
>> 2015-03-12 13:21 GMT+08:00 Ajay :
>>
>>> Hi,
>>>
>>> Are there any steps to do (like nodetool or restart node) or any
>>> precautions after schema changes are done in a column family say adding a
>>> new column or modifying any table properties?
>>>
>>> Thanks
>>> Ajay
>>>
>>
>>
>>
>> --
>> Thanks,
>> Phil Yang
>>
>>
>


Re: Stable cassandra build for production usage

2015-03-12 Thread Robert Coli
On Thu, Mar 12, 2015 at 10:50 AM, Ajay  wrote:

> Please suggest what is the best option in this for production deployment
> in EC2 given that we are deploying Cassandra cluster for the 1st time (so
> likely that we add more data centers/nodes and schema changes in the
> initial few months)
>

Voting for 2.0.13 is in process. I'd wait for that. But I don't need
OpsCenter.

=Rob


Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Ajay
Thanks Nick.

Does it mean that only adding a new node with 2.0.10 or later is a
problem?. If a new node added manually can be monitored from Opscenter?

Thanks
Ajay
On 12-Mar-2015 10:19 pm, "Nick Bailey"  wrote:

> There isn't an OpsCenter specific mailing list no.
>
> To answer your question, the reason OpsCenter provisioning doesn't support
> 2.0.10 and 2.0.11 is due to
> https://issues.apache.org/jira/browse/CASSANDRA-8072.
>
> That bug unfortunately prevents OpsCenter provisioning from working
> correctly, but isn't serious outside of provisioning. OpsCenter may be able
> to come up with a workaround but at the moment those versions are
> unsupported. Sorry for inconvenience.
>
> -Nick
>
> On Thu, Mar 12, 2015 at 9:18 AM, Ajay  wrote:
>
>> Is there a separate forum for Opscenter?
>>
>> Thanks
>> Ajay
>> On 11-Mar-2015 4:16 pm, "Ajay"  wrote:
>>
>>> Hi,
>>>
>>> While adding a Cassandra node using OpsCenter (which is recommended),
>>> the versions of Cassandra (Datastax community edition) shows only 2.0.9 and
>>> not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is
>>> recommended than 2.0.11?
>>>
>>> Thanks
>>> Ajay
>>>
>>
>


Re: CQL 3.x Update ...USING TIMESTAMP...

2015-03-12 Thread Jonathan Haddad
In most datacenters you're going to see significant variance in your server
times.  Likely > 20ms between servers in the same rack.  Even google, using
atomic clocks, has 1-7ms variance.  [1]

I would +1 Tyler's advice here, as using the clocks is only valid if clocks
are perfectly sync'ed, which they are not, and likely never will be in our
lifetime.

[1] http://queue.acm.org/detail.cfm?id=2745385


On Thu, Mar 12, 2015 at 7:04 AM Eric Stevens  wrote:

> > It's possible, but you'll end up with problems when attempting to
> overwrite or delete entries
>
> I'm wondering if you can elucidate on that a little bit, do you just mean
> that it's easy to forget to always set your timestamp correctly, and if you
> goof it up, it makes it difficult to recover from (i.e. you issue a delete
> with system timestamp instead of document version, and that's way larger
> than your document version would ever be, so you can never write that
> document again)?  Or is there some bug in write timestamps that can cause
> the wrong entry to win the write contention?
>
> We're looking at doing something similar to keep a live max value column
> in a given table, our setup is as follows:
>
> CREATE TABLE a (
>   id ,
>   time timestamp,
>   max_b_foo int,
>   PRIMARY KEY (id)
> );
> CREATE TABLE b (
>   b_id ,
>   a_id ,
>   a_timestamp timestamp,
>   foo int,
>   PRIMARY KEY (a_id, b_id)
> );
>
> The idea being that there's a one-to-many relationship between *a* and *b*.
> We want *a* to know what the maximum value is in *b* for field *foo* so
> we can avoid reading *all* *b* when we want to resolve *a*. You can see
> that we can't just use *b*'s clustering key to resolve that with LIMIT 1;
> also this is for DSE Solr, which wouldn't be able to query a by max b.foo
> anyway.  So when we write to *b*, we also write to *a* with something
> like
>
> UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET max_b_foo =
> ${b.foo} WHERE id = ${b.a_id}
>
> Assuming that we don't run afoul of related antipatterns such as
> repeatedly overwriting the same value indefinitely, this strikes me as
> sound if unorthodox practice, as long as conflict resolution in Cassandra
> isn't broken in some subtle way.  We also designed this to be safe from
> getting write timestamps greatly out of sync with clock time so that
> non-timestamped operations (especially delete) if done accidentally will
> still have a reasonable chance of having the expected results.
>
> So while it may not be the intended use case for write timestamps, and
> there are definitely gotchas if you are not careful or misunderstand the
> consequences, as far as I can see the logic behind it is sound but does
> rely on correct conflict resolution in Cassandra.  I'm curious if I'm
> missing or misunderstanding something important.
>
> On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs  wrote:
>
>> Don't use the version as your timestamp.  It's possible, but you'll end
>> up with problems when attempting to overwrite or delete entries.
>>
>> Instead, make the version part of the primary key:
>>
>> CREATE TABLE document_store (document_id bigint, version int, document
>> text, PRIMARY KEY (document_id, version)) WITH CLUSTERING ORDER BY (version
>> desc)
>>
>> That way you don't have to worry about overwriting higher versions with a
>> lower one, and to read the latest version, you only have to do:
>>
>> SELECT * FROM document_store WHERE document_id = ? LIMIT 1;
>>
>> Another option is to use lightweight transactions (i.e. UPDATE ... SET
>> docuement = ?, version = ? WHERE document_id = ? IF version < ?), but
>> that's going to make writes much more expensive.
>>
>> On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam  wrote:
>>
>>> I am planning to use the Update...USING TIMESTAMP... statement to make
>>> sure that I do not overwrite fresh data with stale data while having to
>>> avoid doing at least LOCAL_QUORUM writes.
>>>
>>> Here is my table structure.
>>>
>>> Table=DocumentStore
>>> DocumentID (primaryKey, bigint)
>>> Document(text)
>>> Version(int)
>>>
>>> If the service receives 2 write requests with Version=1 and Version=2,
>>> regardless of the order of arrival, the business requirement is that we end
>>> up with Version=2 in the database.
>>>
>>> Can I use the following CQL Statement?
>>>
>>> Update DocumentStore using 
>>> SET  Document=,
>>> Version=
>>> where DocumentID=;
>>>
>>> Has anybody used something like this? If so was the behavior as expected?
>>>
>>> Regards
>>> Sachin
>>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>
>


Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Nick Bailey
Correct, Opscenter can monitor 2.0.10 and later clusters/nodes. It just
can't provision them.

On Thu, Mar 12, 2015 at 1:16 PM, Ajay  wrote:

> Thanks Nick.
>
> Does it mean that only adding a new node with 2.0.10 or later is a
> problem?. If a new node added manually can be monitored from Opscenter?
>
> Thanks
> Ajay
> On 12-Mar-2015 10:19 pm, "Nick Bailey"  wrote:
>
>> There isn't an OpsCenter specific mailing list no.
>>
>> To answer your question, the reason OpsCenter provisioning doesn't
>> support 2.0.10 and 2.0.11 is due to
>> https://issues.apache.org/jira/browse/CASSANDRA-8072.
>>
>> That bug unfortunately prevents OpsCenter provisioning from working
>> correctly, but isn't serious outside of provisioning. OpsCenter may be able
>> to come up with a workaround but at the moment those versions are
>> unsupported. Sorry for inconvenience.
>>
>> -Nick
>>
>> On Thu, Mar 12, 2015 at 9:18 AM, Ajay  wrote:
>>
>>> Is there a separate forum for Opscenter?
>>>
>>> Thanks
>>> Ajay
>>> On 11-Mar-2015 4:16 pm, "Ajay"  wrote:
>>>
 Hi,

 While adding a Cassandra node using OpsCenter (which is recommended),
 the versions of Cassandra (Datastax community edition) shows only 2.0.9 and
 not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is
 recommended than 2.0.11?

 Thanks
 Ajay

>>>
>>


Node data sync/recovery process

2015-03-12 Thread Akash Pandey
Hi

I had a doubt regarding C* node recovery process.

Assumption :  A two data center C* cluster with RF=3 and CL=LOCAL_QUORUM

Suppose a node went down for a period within hinted_handoff time. Once the
node comes up automatic data syncing would start for that node. This
recovery may take some time.
So my doubt is, during this period, if a read for a key stored on
recovering node comes, will the coordinator node ask for data from
recovering nodes which might possibly have stale data?

If yes, then, how does a C* client handle the situation when majority of
nodes for a key in one data center are recovering and it can end up getting
stale data?

Help much appreciated.

Thanks
Akash


Re: CQL 3.x Update ...USING TIMESTAMP...

2015-03-12 Thread Eric Stevens
Ok, but if you're using a system of time that isn't server clock oriented
(Sachin's document revision ID, and my fixed and necessarily consistent
base timestamp [B's always know their parent A's exact recorded
timestamp]), isn't the principle of using timestamps to force a particular
update out of several to win still sound?

> as using the clocks is only valid if clocks are perfectly sync'ed, which
they are not

Clock skew is a problem which doesn't seem to be a factor in either use
case given that both have a consistent external source of truth for
timestamp.

On Thu, Mar 12, 2015 at 12:58 PM, Jonathan Haddad  wrote:

> In most datacenters you're going to see significant variance in your
> server times.  Likely > 20ms between servers in the same rack.  Even
> google, using atomic clocks, has 1-7ms variance.  [1]
>
> I would +1 Tyler's advice here, as using the clocks is only valid if
> clocks are perfectly sync'ed, which they are not, and likely never will be
> in our lifetime.
>
> [1] http://queue.acm.org/detail.cfm?id=2745385
>
>
> On Thu, Mar 12, 2015 at 7:04 AM Eric Stevens  wrote:
>
>> > It's possible, but you'll end up with problems when attempting to
>> overwrite or delete entries
>>
>> I'm wondering if you can elucidate on that a little bit, do you just mean
>> that it's easy to forget to always set your timestamp correctly, and if you
>> goof it up, it makes it difficult to recover from (i.e. you issue a delete
>> with system timestamp instead of document version, and that's way larger
>> than your document version would ever be, so you can never write that
>> document again)?  Or is there some bug in write timestamps that can cause
>> the wrong entry to win the write contention?
>>
>> We're looking at doing something similar to keep a live max value column
>> in a given table, our setup is as follows:
>>
>> CREATE TABLE a (
>>   id ,
>>   time timestamp,
>>   max_b_foo int,
>>   PRIMARY KEY (id)
>> );
>> CREATE TABLE b (
>>   b_id ,
>>   a_id ,
>>   a_timestamp timestamp,
>>   foo int,
>>   PRIMARY KEY (a_id, b_id)
>> );
>>
>> The idea being that there's a one-to-many relationship between *a* and
>> *b*.  We want *a* to know what the maximum value is in *b* for field
>> *foo* so we can avoid reading *all* *b* when we want to resolve *a*. You
>> can see that we can't just use *b*'s clustering key to resolve that with
>> LIMIT 1; also this is for DSE Solr, which wouldn't be able to query a by
>> max b.foo anyway.  So when we write to *b*, we also write to *a* with
>> something like
>>
>> UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET max_b_foo
>> = ${b.foo} WHERE id = ${b.a_id}
>>
>> Assuming that we don't run afoul of related antipatterns such as
>> repeatedly overwriting the same value indefinitely, this strikes me as
>> sound if unorthodox practice, as long as conflict resolution in Cassandra
>> isn't broken in some subtle way.  We also designed this to be safe from
>> getting write timestamps greatly out of sync with clock time so that
>> non-timestamped operations (especially delete) if done accidentally will
>> still have a reasonable chance of having the expected results.
>>
>> So while it may not be the intended use case for write timestamps, and
>> there are definitely gotchas if you are not careful or misunderstand the
>> consequences, as far as I can see the logic behind it is sound but does
>> rely on correct conflict resolution in Cassandra.  I'm curious if I'm
>> missing or misunderstanding something important.
>>
>> On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs  wrote:
>>
>>> Don't use the version as your timestamp.  It's possible, but you'll end
>>> up with problems when attempting to overwrite or delete entries.
>>>
>>> Instead, make the version part of the primary key:
>>>
>>> CREATE TABLE document_store (document_id bigint, version int, document
>>> text, PRIMARY KEY (document_id, version)) WITH CLUSTERING ORDER BY (version
>>> desc)
>>>
>>> That way you don't have to worry about overwriting higher versions with
>>> a lower one, and to read the latest version, you only have to do:
>>>
>>> SELECT * FROM document_store WHERE document_id = ? LIMIT 1;
>>>
>>> Another option is to use lightweight transactions (i.e. UPDATE ... SET
>>> docuement = ?, version = ? WHERE document_id = ? IF version < ?), but
>>> that's going to make writes much more expensive.
>>>
>>> On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam 
>>> wrote:
>>>
 I am planning to use the Update...USING TIMESTAMP... statement to make
 sure that I do not overwrite fresh data with stale data while having to
 avoid doing at least LOCAL_QUORUM writes.

 Here is my table structure.

 Table=DocumentStore
 DocumentID (primaryKey, bigint)
 Document(text)
 Version(int)

 If the service receives 2 write requests with Version=1 and Version=2,
 regardless of the order of arrival, the business requirement is that we end
 up with Version=2 in the database.


Re: how to clear data from disk

2015-03-12 Thread Ben Bromhead
To clarify on why this behaviour occurs, by default Cassandra will snapshot
a table when you perform any destructive action (TRUNCATE, DROP etc)

see
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/truncate_r.html

To free disk space after such an operation you will always need to clear
the snapshots (using either of above suggested methods). Unfortunately this
can be a bit painful if you are rotating your tables, say by month, and
want to remove the oldest one from disk as your client will need to speak
JMX as well.

You can disable this behaviour through the use of auto_snapshot in
cassandra.yaml. Though I would strongly recommend leaving this feature
enabled in any sane production environment and cleaning up snapshots as an
independent task!!

On 10 March 2015 at 20:43, Patrick McFadin  wrote:

> Or just manually delete the files. The directories are broken down by
> keyspace and table.
>
> Patrick
>
> On Mon, Mar 9, 2015 at 7:50 PM, 曹志富  wrote:
>
>> nodetool clearsnapshot
>>
>> --
>> Ranger Tsao
>>
>> 2015-03-10 10:47 GMT+08:00 鄢来琼 :
>>
>>>  Hi ALL,
>>>
>>>
>>>
>>> After drop table, I found the data is not removed from disk, I should
>>> reduce the gc_grace_seconds before the drop operation.
>>>
>>> I have to wait for 10 days, but there is not enough disk.
>>>
>>> Could you tell me there is method to clear the data from disk quickly?
>>>
>>> Thank you very much!
>>>
>>>
>>>
>>> Peter
>>>
>>
>>
>


-- 

Ben Bromhead

Instaclustr | www.instaclustr.com | @instaclustr
 | (650) 284 9692


Re: DataStax Enterprise Amazon AMI Launch Error

2015-03-12 Thread Vanessa Gligor
When I try to launch an EC2 Instance using DataStax community it's working,
so I have the premission to create an EC2 instance. I want to have
Cassandra and Solr as services installed on an VM.

Thanks!

On Thu, Mar 12, 2015 at 2:56 PM, Ali Akhtar  wrote:

> Seems like its having trouble launching the other EC2 instances that
> you're requesting. You would need to provide it your AWS credentials for an
> account that has the permissions to create EC2 instances. Have you done
> that?
>
> If you just want to install cassandra on AWS, you might find this bash
> script useful: https://gist.github.com/aliakhtar/3649e412787034156cbb
>
> On Thu, Mar 12, 2015 at 5:14 PM, Vanessa Gligor 
> wrote:
>
>> I'm trying to launch a new instance of DataStax AMI on a EC2 Amazon
>> instance. I tried this in 2 different regions (us-east and eu-west), using
>> these AMIs: ami-ada2b6c4, ami-814ec2e8 (us-east) and ami-7f33cd08,
>> ami-b2212dc6 (eu-west).
>>
>> I followed this documentation:
>> http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html
>>
>> So this is what I've done so far:
>>
>> 1. I've created a new security group (with those specific ports - I
>> cannot upload the print screen because I have just created this account)
>>
>> 2. I've create a new key pair
>>
>> 3. I've launched the DataStax AMI with these configuration details:
>> --clustername cluster --totalnodes 4 --version enterprise --username
>> my_name --password my_password --searchnodes 2 (I have verified my
>> credentials - I can login here http://debian.datastax.com/enterprise/ )
>>
>> 4. After selecting the previous created security group & key pair I
>> launched the instance
>>
>> 5. I've connected to my DataStax Enterprise EC2 instance and this is the
>> displayed log:
>>
>> Cluster started with these options: --clustername cluster --totalnodes 4
>> --version enterprise --username my_name --password  --searchnodes 2
>>
>> 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
>> [u'172.31.34.171']... Exception seen in ds1_launcher.py. Please check
>> ~/datastax_ami/ami.log for more info. Please visit 
>>
>>
>> and the ami.log shows these messages:
>>
>>
>> [INFO] 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from: 
>> [u'172.31.34.171']
>> [ERROR] EC2 is experiencing some issues and has not allocated all of the 
>> resources in under 10 minutes.
>> Aborting the clustering of this reservation. Please try again.
>> [ERROR] Exception seen in ds1_launcher.py:
>> Traceback (most recent call last):
>> File "/home/ubuntu/datastax_ami/ds1_launcher.py", line 22, in 
>> initial_configurations
>> ds2_configure.run()
>>  File "/home/ubuntu/datastax_ami/ds2_configure.py", line 1135, in run
>> File "/home/ubuntu/datastax_ami/ds2_configure.py", line 57, in exit_path
>> AttributeError: EC2 is experiencing some issues and has not allocated all of 
>> the resources in under 10 minutes.
>> Aborting the clustering of this reservation. Please try again.
>>
>> Any suggestion on how to fix this problem?
>>
>> Thank you!
>>
>> Have a nice day,
>>
>> Vanessa.
>>
>>
>