date:20160225

Re: Cassandra nodes reduce disks per node

2016-02-25 Thread Alain RODRIGUEZ

For what it is worth, I finally wrote a blog post about this -->
http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html

If you are not done yet, every step is detailed in there.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ :

> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>> have --delete-before when you're copying data to a temp (assumed empty)
>> directory?
>
>
>  Since they are immutable I do a first sync while everything is up and
>> running to the new location which runs really long. Meanwhile new ones are
>> created and I sync them again online, much less files to copy now. After
>> that I shutdown the node and my last rsync now has to copy only a few files
>> which is quite fast and so the downtime for that node is within minutes.
>
>
> Jan guess is right. Except for the "immutable" thing. Compaction can make
> big files go away, replaced by bigger ones you'll have to stream again.
>
> Here is a detailed explanation about what I did it this way.
>
> More precisely, let's say we have 10 files of 100 GB on the disk to remove
> (let's say 'old-dir')
>
> I run a first rsync to an empty folder indeed (let's call this 'tmp-dir'),
> in the disk that will remain after the operation. Let's say this takes
> about 10 hours. This can be run in parallel though.
>
> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one
> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB.
>
> At this point I disable compaction, stop running ones.
>
> My second rsync has to remove the 4 files that were compacted from
> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir needs
> to be mirroring old-dir, this is fine. This new operation takes 3.5 hours,
> also runnable in parallel (Keep in mind C* won't compact anything for 3.5
> hours, that's why I did not stopped compaction before the first rsync, in
> my case dataset was 2 TB big)
>
> At this point I have 950 GB in tmp-dir, but meanwhile clients continued to
> write on the disk. let's say 50 GB more.
>
> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add the
> diff to tmp-dir. Still runnable in parallel.
>
> Then the script stop the node, so should be run sequentially, and perform
> 2 more rsync, the first one to take the diff between end of 3rd rsync and
> the moment you stop the node, should be a few seconds, minutes maybe,
> depending how fast you ran the script after 3rd rsync ended. The second
> rsync in the script is a 'useless' one. I just like to control things. I
> run it, expect to see it to say that there is no diff. It is just a way to
> stop the script if for some reason data is still being appended to old-dir.
>
> Then I just move all the files from tmp-dir to new-dir (the proper data
> dir remaining after the operation). This is an instant op a files are not
> really moved as they already are on disk. That's due to system files
> property.
>
> I finally unmount and rm -rf old-dir.
>
> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and
> nodes are down for about 5-10 min.
>
> VS
>
> Straight forward way (stop node, move, start node) : 10 h * number of node
> as this needs to be sequential. Plus each node is down for 10 hours, you
> have to repair them as it is higher than hinted handoff limit...
>
> Branton, I did not went through your process, but I guess you will be able
> to review it by yourself after reading the above (typically, repair is not
> needed if you use the strategy I describe above, as node is down for 5-10
> minutes). Also, not sure how "rsync -azvuiP /var/data/cassandra/data2/
> /var/data/cassandra/data/" will behave, my guess i this is going to do a
> copy, so this might be very long. My script perform an instant move and as
> the next command is 'rm -Rf /var/data/cassandra/data2' I see no reason
> copying rather than moving files.
>
> Your solution would probably work, but with big constraints on operational
> point of view (very long operation + repair needed)
>
> Hope this long email will be useful, maybe should I blog about this. Let
> me know if the process above makes sense or if some things might be
> improved.
>
> C*heers,
> -
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 7:19 GMT+01:00 Branton Davis :
>
>> Jan, thanks!  That makes perfect sense to run a second time before
>> stopping cassandra.  I'll add that in when I do the production cluster.
>>
>> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten 
>> wrote:
>>
>>> Hi Branton,
>>>
>>> two cents from me - I didnt look through the script, but for the rsyncs
>>> I do pretty much the same when moving them. Since they are immutable I do a
>>> first sync while

Handling uncommitted paxos state

2016-02-25 Thread Nicholas Wilson

Hi,

I have some questions about the behaviour of 'uncommitted paxos state', as 
described here:

http://www.datastax.com/dev/blog/cassandra-error-handling-done-right

If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS write, 
that means that the paxos phase was successful, but the data couldn't be 
committed during the final 'commit/reset' phase. On the next SERIAL write or 
read, any other node can commit the write on behalf of the original proposer, 
and must do so in fact before forming a new ballot. The stops the columns from 
getting 'stuck' if the coordinator experiences a network partition after 
forming the ballot, but before committing.

My questions are on the durability of the uncommitted state:

Suppose CAS writes are infrequent, and it takes weeks before another write is 
done to that column; will the paxos state still be there, waiting forever until 
the next commit, or does it get automatically committed during GC if you wait 
long enough? (I don't see how it could be cleaned up by a GC though, since the 
nodes holding the paxos state don't know if the ballot was won or not.)

Or, what if all the nodes are switched off (briefly); is the uncommitted paxos 
state persisted to disk in the log/journal, so the write can still be completed 
when the cluster comes back online?

Finally, how granular is the paxos state? Will the uncommitted write be 
completed on the next SERIAL write that touches the same exact combination of 
cells, or is it per-column across the partition, or? If the CAS write 
touches two or three cells in the row, will a subsequent SERIAL read from any 
one of those three columns complete the uncommitted state, presumably on the 
other columns as well?

Thanks for your help,
Nick

---
Nick Wilson
Software engineer, RealVNC

Re: Cassandra nodes reduce disks per node

2016-02-25 Thread Anishek Agarwal

Nice thanks !

On Thu, Feb 25, 2016 at 1:51 PM, Alain RODRIGUEZ  wrote:

> For what it is worth, I finally wrote a blog post about this -->
> http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html
>
> If you are not done yet, every step is detailed in there.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ :
>
>> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>>> have --delete-before when you're copying data to a temp (assumed empty)
>>> directory?
>>
>>
>>  Since they are immutable I do a first sync while everything is up and
>>> running to the new location which runs really long. Meanwhile new ones are
>>> created and I sync them again online, much less files to copy now. After
>>> that I shutdown the node and my last rsync now has to copy only a few files
>>> which is quite fast and so the downtime for that node is within minutes.
>>
>>
>> Jan guess is right. Except for the "immutable" thing. Compaction can
>> make big files go away, replaced by bigger ones you'll have to stream again.
>>
>> Here is a detailed explanation about what I did it this way.
>>
>> More precisely, let's say we have 10 files of 100 GB on the disk to
>> remove (let's say 'old-dir')
>>
>> I run a first rsync to an empty folder indeed (let's call this
>> 'tmp-dir'), in the disk that will remain after the operation. Let's say
>> this takes about 10 hours. This can be run in parallel though.
>>
>> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one
>> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB.
>>
>> At this point I disable compaction, stop running ones.
>>
>> My second rsync has to remove the 4 files that were compacted from
>> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir
>> needs to be mirroring old-dir, this is fine. This new operation takes 3.5
>> hours, also runnable in parallel (Keep in mind C* won't compact anything
>> for 3.5 hours, that's why I did not stopped compaction before the first
>> rsync, in my case dataset was 2 TB big)
>>
>> At this point I have 950 GB in tmp-dir, but meanwhile clients continued
>> to write on the disk. let's say 50 GB more.
>>
>> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add
>> the diff to tmp-dir. Still runnable in parallel.
>>
>> Then the script stop the node, so should be run sequentially, and perform
>> 2 more rsync, the first one to take the diff between end of 3rd rsync and
>> the moment you stop the node, should be a few seconds, minutes maybe,
>> depending how fast you ran the script after 3rd rsync ended. The second
>> rsync in the script is a 'useless' one. I just like to control things. I
>> run it, expect to see it to say that there is no diff. It is just a way to
>> stop the script if for some reason data is still being appended to old-dir.
>>
>> Then I just move all the files from tmp-dir to new-dir (the proper data
>> dir remaining after the operation). This is an instant op a files are not
>> really moved as they already are on disk. That's due to system files
>> property.
>>
>> I finally unmount and rm -rf old-dir.
>>
>> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and
>> nodes are down for about 5-10 min.
>>
>> VS
>>
>> Straight forward way (stop node, move, start node) : 10 h * number of
>> node as this needs to be sequential. Plus each node is down for 10 hours,
>> you have to repair them as it is higher than hinted handoff limit...
>>
>> Branton, I did not went through your process, but I guess you will be
>> able to review it by yourself after reading the above (typically, repair is
>> not needed if you use the strategy I describe above, as node is down for
>> 5-10 minutes). Also, not sure how "rsync -azvuiP
>> /var/data/cassandra/data2/ /var/data/cassandra/data/" will behave, my guess
>> i this is going to do a copy, so this might be very long. My script perform
>> an instant move and as the next command is 'rm -Rf
>> /var/data/cassandra/data2' I see no reason copying rather than moving files.
>>
>> Your solution would probably work, but with big constraints on
>> operational point of view (very long operation + repair needed)
>>
>> Hope this long email will be useful, maybe should I blog about this. Let
>> me know if the process above makes sense or if some things might be
>> improved.
>>
>> C*heers,
>> -
>> Alain Rodriguez
>> France
>>
>> The Last Pickle
>> http://www.thelastpickle.com
>>
>> 2016-02-19 7:19 GMT+01:00 Branton Davis :
>>
>>> Jan, thanks!  That makes perfect sense to run a second time before
>>> stopping cassandra.  I'll add that in when I do the production cluster.
>>>
>>> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten 
>>> wrote:
>>>

Re: Cassandra Data Audit

2016-02-25 Thread Jack Krupansky

There is an open Jira on this exact topic - Change Data Capture (CDC):
https://issues.apache.org/jira/browse/CASSANDRA-8844

Unfortunately, open means not yet done.

-- Jack Krupansky

On Thu, Feb 25, 2016 at 2:13 AM, Charulata Sharma (charshar) <
chars...@cisco.com> wrote:

> Thanks for the responses. I was looking for something available in
> Cassandra Open Stack not DSE.
>
> Looks like there isn’t any, so planning to create a Column family and have
> it populated.
>
>
>
> Thanks,
>
> Charu
>
>
>
> *From:* Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> *Sent:* Wednesday, February 24, 2016 2:57 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra Data Audit
>
>
>
> From
> http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/sec/secAuditCassandraTableColumns.html
>
>
>
> I guess you will not have the previous value that easy, yet all the
> operations seems to be logged, so looking for the last insert operation on
> a specific partition should give you the information you are looking for.
>
>
>
> Once again, I never used it, I just wanted to point this to you since it
> could be what you are looking for. Maybe someone else will be able to give
> you some more detailed informations. If you use DSE, you should be able to
> ask Datastax directly as this is DSE specific (AFAIK).
>
>
>
> C*heers,
>
> -
>
> Alain Rodriguez
>
> France
>
>
>
> The Last Pickle
>
> http://www.thelastpickle.com
>
>
>
> 2016-02-24 11:41 GMT+01:00 Raman Gugnani :
>
> Hi Alain,
>
>
>
> As per the document. Which column of the dse_audit.audit_log will hold
> the previous or new data.
>
>
>
> On Wed, Feb 24, 2016 at 3:59 PM, Alain RODRIGUEZ 
> wrote:
>
> Hi Charu,
>
>
>
> Are you using DSE or Open source Cassandra ?
>
>
>
> I never used it, but DSE brings a feature that seems to be what you are
> looking for -->
> http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/sec/secAuditingCassandraTable.html
>
>
>
> Never heard about such a thing in the Open source version though.
>
>
>
> C*heers,
>
> -
>
> Alain Rodriguez
>
> France
>
>
>
> The Last Pickle
>
> http://www.thelastpickle.com
>
>
>
> 2016-02-24 6:36 GMT+01:00 Charulata Sharma (charshar)  >:
>
> To all Cassandra experts out there,
>
>  Can you please let me know if there is any inbuilt Cassandra
> feature that allows audits on Column family data ?
>
>
>
> When I change any data in a CF, I want to record that change. Probably
> store the old value as well as the changed one.
>
> One way of doing this is to create new CFs , but I wanted to know if there
> is any standard C* feature that could be used.
>
> Any guidance in this and implementation approaches would really help.
>
>
>
> Thanks,
>
> Charu
>
>
>
>
>
>
>
> --
>
> Thanks & Regards
>
>
>
>
> *Raman Gugnani **Senior Software Engineer | CaMS*
>
> M: +91 8588892293 | T: 0124-660 | EXT: 14255
> ASF Centre A | 2nd Floor | CA-2130 | Udyog Vihar Phase IV |
> Gurgaon | Haryana | India
>
>
>
>
>
> *Disclaimer:* This communication is for the sole use of the addressee and
> is confidential and privileged information. If you are not the intended
> recipient of this communication, you are prohibited from disclosing it and
> are required to delete it forthwith. Please note that the contents of this
> communication do not necessarily represent the views of Jasper Infotech
> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
> secure or error-free as information could be intercepted, corrupted, lost,
> destroyed, arrive late or incomplete, or contain viruses. The Company,
> therefore, does not accept liability for any loss caused due to this
> communication. *Jasper Infotech Private Limited, Registered Office: 1st
> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
> U72300DL2007PTC168097*
>
>
>

Re: Cassandra nodes reduce disks per node

2016-02-25 Thread Alain RODRIGUEZ

You're welcome, if you have some feedback you can comment the blog post :-).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-02-25 12:28 GMT+01:00 Anishek Agarwal :

> Nice thanks !
>
> On Thu, Feb 25, 2016 at 1:51 PM, Alain RODRIGUEZ 
> wrote:
>
>> For what it is worth, I finally wrote a blog post about this -->
>> http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html
>>
>> If you are not done yet, every step is detailed in there.
>>
>> C*heers,
>> ---
>> Alain Rodriguez - al...@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> 2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ :
>>
>>> Alain, thanks for sharing!  I'm confused why you do so many repetitive
 rsyncs.  Just being cautious or is there another reason?  Also, why do you
 have --delete-before when you're copying data to a temp (assumed empty)
 directory?
>>>
>>>
>>>  Since they are immutable I do a first sync while everything is up and
 running to the new location which runs really long. Meanwhile new ones are
 created and I sync them again online, much less files to copy now. After
 that I shutdown the node and my last rsync now has to copy only a few files
 which is quite fast and so the downtime for that node is within minutes.
>>>
>>>
>>> Jan guess is right. Except for the "immutable" thing. Compaction can
>>> make big files go away, replaced by bigger ones you'll have to stream again.
>>>
>>> Here is a detailed explanation about what I did it this way.
>>>
>>> More precisely, let's say we have 10 files of 100 GB on the disk to
>>> remove (let's say 'old-dir')
>>>
>>> I run a first rsync to an empty folder indeed (let's call this
>>> 'tmp-dir'), in the disk that will remain after the operation. Let's say
>>> this takes about 10 hours. This can be run in parallel though.
>>>
>>> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one
>>> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB.
>>>
>>> At this point I disable compaction, stop running ones.
>>>
>>> My second rsync has to remove the 4 files that were compacted from
>>> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir
>>> needs to be mirroring old-dir, this is fine. This new operation takes 3.5
>>> hours, also runnable in parallel (Keep in mind C* won't compact anything
>>> for 3.5 hours, that's why I did not stopped compaction before the first
>>> rsync, in my case dataset was 2 TB big)
>>>
>>> At this point I have 950 GB in tmp-dir, but meanwhile clients continued
>>> to write on the disk. let's say 50 GB more.
>>>
>>> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add
>>> the diff to tmp-dir. Still runnable in parallel.
>>>
>>> Then the script stop the node, so should be run sequentially, and
>>> perform 2 more rsync, the first one to take the diff between end of 3rd
>>> rsync and the moment you stop the node, should be a few seconds, minutes
>>> maybe, depending how fast you ran the script after 3rd rsync ended. The
>>> second rsync in the script is a 'useless' one. I just like to control
>>> things. I run it, expect to see it to say that there is no diff. It is just
>>> a way to stop the script if for some reason data is still being appended to
>>> old-dir.
>>>
>>> Then I just move all the files from tmp-dir to new-dir (the proper data
>>> dir remaining after the operation). This is an instant op a files are not
>>> really moved as they already are on disk. That's due to system files
>>> property.
>>>
>>> I finally unmount and rm -rf old-dir.
>>>
>>> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and
>>> nodes are down for about 5-10 min.
>>>
>>> VS
>>>
>>> Straight forward way (stop node, move, start node) : 10 h * number of
>>> node as this needs to be sequential. Plus each node is down for 10 hours,
>>> you have to repair them as it is higher than hinted handoff limit...
>>>
>>> Branton, I did not went through your process, but I guess you will be
>>> able to review it by yourself after reading the above (typically, repair is
>>> not needed if you use the strategy I describe above, as node is down for
>>> 5-10 minutes). Also, not sure how "rsync -azvuiP
>>> /var/data/cassandra/data2/ /var/data/cassandra/data/" will behave, my guess
>>> i this is going to do a copy, so this might be very long. My script perform
>>> an instant move and as the next command is 'rm -Rf
>>> /var/data/cassandra/data2' I see no reason copying rather than moving files.
>>>
>>> Your solution would probably work, but with big constraints on
>>> operational point of view (very long operation + repair needed)
>>>
>>> Hope this long email will be useful, maybe should I blog about this. Let
>>> me know if the process above makes sense or if some things might be
>>> impr

Re: how to read parent_repair_history table?

2016-02-25 Thread Paulo Motta

Hello Jimmy,

The parent_repair_history table keeps track of start and finish information
of a repair session.  The other table repair_history keeps track of repair
status as it progresses. So, you must first query the parent_repair_history
table to check if a repair started and finish, as well as its duration, and
inspect the repair_history table to troubleshoot more specific details of a
given repair session.

Answering your questions below:

> Is every invocation of nodetool repair execution will be recorded as one
entry in parent_repair_history CF regardless if it is across DC, local node
repair, or other options ?

Actually two entries, one for start and one for finish.

> A repair job is done only if "finished" column contains value? and a
repair job is successfully done only if there is no value in exce
ption_messages or exception_stacktrace ?

correct

> what is the purpose of successful_ranges column? do i have to check they
are all matched with requested_range to ensure a successful run?

correct

-
> Ultimately, how to find out the overall repair health/status in a given
cluster?

Check if repair is being executed on all nodes within gc_grace_seconds, and
tune that value or troubleshoot problems otherwise.

> Scanning through parent_repair_history and making sure all the known
keyspaces has a good repair run in recent days?

Sounds good.

You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more
information.


2016-02-25 3:13 GMT-03:00 Jimmy Lin :

>
> hi all,
> few questions regarding how to read or digest the
> system_distributed.parent_repair_history CF, that I am very intereted to
> use to find out our repair status...
>
> -
> Is every invocation of nodetool repair execution will be recorded as one
> entry in parent_repair_history CF regardless if it is across DC, local node
> repair, or other options ?
>
> -
> A repair job is done only if "finished" column contains value? and a
> repair job is successfully done only if there is no value in exce
> ption_messages or exception_stacktrace ?
> what is the purpose of successful_ranges column? do i have to check they
> are all matched with requested_range to ensure a successful run?
>
> -
> Ultimately, how to find out the overall repair health/status in a given
> cluster?
> Scanning through parent_repair_history and making sure all the known
> keyspaces has a good repair run in recent days?
>
> ---
> CREATE TABLE system_distributed.parent_repair_history (
> parent_id timeuuid PRIMARY KEY,
> columnfamily_names set,
> exception_message text,
> exception_stacktrace text,
> finished_at timestamp,
> keyspace_name text,
> requested_ranges set,
> started_at timestamp,
> successful_ranges set
> )
>

Consistent read timeouts for bursts of reads

2016-02-25 Thread Emīls Šolmanis

Hello,

We're having a problem with concurrent requests. It seems that whenever we
try resolving more
than ~ 15 queries at the same time, one or two get a read timeout and then
succeed on a retry.

We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on AWS.

What we've found while investigating:

 * this is not db-wide. Trying the same pattern against another table
everything works fine.
 * it fails 1 or 2 requests regardless of how many are executed in
parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
requests and doesn't seem to scale up.
 * the problem is consistently reproducible. It happens both under heavier
load and when just firing off a single batch of requests for testing.
 * tracing the faulty requests says everything is great. An example trace:
https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
 * the only peculiar thing in the logs is there's no acknowledgement of the
request being accepted by the server, as seen in
https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
 * there's nothing funny in the timed out Cassandra node's logs around that
time as far as I can tell, not even in the debug logs.

Any ideas about what might be causing this, pointers to server config
options, or how else we might debug this would be much appreciated.

Kind regards,
Emils

Re: Consistent read timeouts for bursts of reads

2016-02-25 Thread Emīls Šolmanis

Having had a read through the archives, I missed this at first, but this
seems to be *exactly* like what we're experiencing.

http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html

Only difference is we're getting this for reads and using CQL, but the
behaviour is identical.

On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
wrote:

> Hello,
>
> We're having a problem with concurrent requests. It seems that whenever we
> try resolving more
> than ~ 15 queries at the same time, one or two get a read timeout and then
> succeed on a retry.
>
> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
> AWS.
>
> What we've found while investigating:
>
>  * this is not db-wide. Trying the same pattern against another table
> everything works fine.
>  * it fails 1 or 2 requests regardless of how many are executed in
> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
> requests and doesn't seem to scale up.
>  * the problem is consistently reproducible. It happens both under heavier
> load and when just firing off a single batch of requests for testing.
>  * tracing the faulty requests says everything is great. An example trace:
> https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>  * the only peculiar thing in the logs is there's no acknowledgement of
> the request being accepted by the server, as seen in
> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>  * there's nothing funny in the timed out Cassandra node's logs around
> that time as far as I can tell, not even in the debug logs.
>
> Any ideas about what might be causing this, pointers to server config
> options, or how else we might debug this would be much appreciated.
>
> Kind regards,
> Emils
>
>

Re: how to read parent_repair_history table?

2016-02-25 Thread Anuj Wadehra

Hi Jimmy,
We are on 2.0.x. We are planning to use JMX notifications for getting repair 
status. To repair database, we call forceTableRepairPrimaryRange JMX operation 
from our Java client application on each node. You can call other latest JMX 
methods for repair.
I would be keen in knowing the pros/cons of handling repair status via JMX 
notifications Vs via database tables.
We are planning to implement it as follows:
1. Before repairing each keyspace via JMX, register two listeners: one for 
listening to StorageService MBean notifications about repair status and other 
the connection listener for detecting connection failures and lost JMX 
notifications.
2. We ensure that if 256 success session notifications are received, keyspace 
repair is successful. We have 256 ranges on each node.
3.If there are connection closed notifications, we will re-register the Mbean 
listener and retry repair once.
4. If there are Lost Notifications we retry the repair once before failing it.

ThanksAnuj

Sent from Yahoo Mail on Android 

 On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta wrote:  
Hello Jimmy,

The parent_repair_history table keeps track of start and finish information of 
a repair session.  The other table repair_history keeps track of repair status 
as it progresses. So, you must first query the parent_repair_history table to 
check if a repair started and finish, as well as its duration, and inspect the 
repair_history table to troubleshoot more specific details of a given repair 
session.

Answering your questions below:

> Is every invocation of nodetool repair execution will be recorded as one 
> entry in parent_repair_history CF regardless if it is across DC, local node 
> repair, or other options ?
Actually two entries, one for start and one for finish.

> A repair job is done only if "finished" column contains value? and a repair 
> job is successfully done only if there is no value in exce ption_messages or 
> exception_stacktrace ?

correct

> what is the purpose of successful_ranges column? do i have to check they are 
> all matched with requested_range to ensure a successful run?
correct

-
> Ultimately, how to find out the overall repair health/status in a given 
> cluster?

Check if repair is being executed on all nodes within gc_grace_seconds, and 
tune that value or troubleshoot problems otherwise.

> Scanning through parent_repair_history and making sure all the known 
> keyspaces has a good repair run in recent days?

Sounds good.

You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more 
information.

2016-02-25 3:13 GMT-03:00 Jimmy Lin :

hi all,
few questions regarding how to read or digest the 
system_distributed.parent_repair_history CF, that I am very intereted to use to 
find out our repair status... 

-
Is every invocation of nodetool repair execution will be recorded as one entry 
in parent_repair_history CF regardless if it is across DC, local node repair, 
or other options ?
-
A repair job is done only if "finished" column contains value? and a repair job 
is successfully done only if there is no value in exce
ption_messages or exception_stacktrace ?
what is the purpose of successful_ranges column? do i have to check they are 
all matched with requested_range to ensure a successful run?
-
Ultimately, how to find out the overall repair health/status in a given cluster?
Scanning through parent_repair_history and making sure all the known keyspaces 
has a good repair run in recent days?
---
CREATE TABLE system_distributed.parent_repair_history (
    parent_id timeuuid PRIMARY KEY,
    columnfamily_names set,
    exception_message text,
    exception_stacktrace text,
    finished_at timestamp,
    keyspace_name text,
    requested_ranges set,
    started_at timestamp,
    successful_ranges set
)

Re: how to read parent_repair_history table?

2016-02-25 Thread Jimmy Lin

Hi  Anuj,

i never thought of using JMX notification as way to check. 
Partially i think it require a live connection or application to keep the 
notification flowing in, while the DB approach let you look it up whenever you 
want current or the past jobs.
 thanks

Sent from my iPhone

> On Feb 25, 2016, at 9:25 AM, Anuj Wadehra  wrote:
> 
> Hi Jimmy,
> 
> We are on 2.0.x. We are planning to use JMX notifications for getting repair 
> status. To repair database, we call forceTableRepairPrimaryRange JMX 
> operation from our Java client application on each node. You can call other 
> latest JMX methods for repair.
> 
> I would be keen in knowing the pros/cons of handling repair status via JMX 
> notifications Vs via database tables.
> 
> We are planning to implement it as follows:
> 
> 1. Before repairing each keyspace via JMX, register two listeners: one for 
> listening to StorageService MBean notifications about repair status and other 
> the connection listener for detecting connection failures and lost JMX 
> notifications.
> 
> 2. We ensure that if 256 success session notifications are received, keyspace 
> repair is successful. We have 256 ranges on each node.
> 
> 3.If there are connection closed notifications, we will re-register the Mbean 
> listener and retry repair once.
> 
> 4. If there are Lost Notifications we retry the repair once before failing it.
> 
> 
> 
> Thanks
> Anuj
> 
> 
> Sent from Yahoo Mail on Android
> 
> On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta
>  wrote:
> Hello Jimmy,
> 
> The parent_repair_history table keeps track of start and finish information 
> of a repair session.  The other table repair_history keeps track of repair 
> status as it progresses. So, you must first query the parent_repair_history 
> table to check if a repair started and finish, as well as its duration, and 
> inspect the repair_history table to troubleshoot more specific details of a 
> given repair session.
> 
> Answering your questions below:
> 
> > Is every invocation of nodetool repair execution will be recorded as one 
> > entry in parent_repair_history CF regardless if it is across DC, local node 
> > repair, or other options ?
> 
> Actually two entries, one for start and one for finish.
> 
> > A repair job is done only if "finished" column contains value? and a repair 
> > job is successfully done only if there is no value in exce ption_messages 
> > or exception_stacktrace ?
> 
> correct
> 
> > what is the purpose of successful_ranges column? do i have to check they 
> > are all matched with requested_range to ensure a successful run?
> 
> correct
> 
> -
> > Ultimately, how to find out the overall repair health/status in a given 
> > cluster?
> 
> Check if repair is being executed on all nodes within gc_grace_seconds, and 
> tune that value or troubleshoot problems otherwise.
> 
> > Scanning through parent_repair_history and making sure all the known 
> > keyspaces has a good repair run in recent days?
> 
> Sounds good.
> 
> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more 
> information.
> 
> 
> 2016-02-25 3:13 GMT-03:00 Jimmy Lin :
>> 
>> hi all,
>> few questions regarding how to read or digest the 
>> system_distributed.parent_repair_history CF, that I am very intereted to use 
>> to find out our repair status... 
>>  
>> -
>> Is every invocation of nodetool repair execution will be recorded as one 
>> entry in parent_repair_history CF regardless if it is across DC, local node 
>> repair, or other options ?
>> 
>> -
>> A repair job is done only if "finished" column contains value? and a repair 
>> job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>> what is the purpose of successful_ranges column? do i have to check they are 
>> all matched with requested_range to ensure a successful run?
>> 
>> -
>> Ultimately, how to find out the overall repair health/status in a given 
>> cluster?
>> Scanning through parent_repair_history and making sure all the known 
>> keyspaces has a good repair run in recent days?
>> 
>> ---
>> CREATE TABLE system_distributed.parent_repair_history (
>> parent_id timeuuid PRIMARY KEY,
>> columnfamily_names set,
>> exception_message text,
>> exception_stacktrace text,
>> finished_at timestamp,
>> keyspace_name text,
>> requested_ranges set,
>> started_at timestamp,
>> successful_ranges set
>> )
>

Re: how to read parent_repair_history table?

2016-02-25 Thread Jimmy Lin

hi Paulo, 
follow up on the # of entries question... 
 why each job repair execution will have 2 entries? I thought it will be one 
entry, begining with started_at column filled, and when it completed, 
finished_at column will be filled. 
Also, if my cluster has more than 1 keyspace, and the way this table is 
structured, it will have multiple entries, one for each keysapce_name value. no 
? thanks


Sent from my iPhone

> On Feb 25, 2016, at 5:48 AM, Paulo Motta  wrote:
> 
> Hello Jimmy,
> 
> The parent_repair_history table keeps track of start and finish information 
> of a repair session.  The other table repair_history keeps track of repair 
> status as it progresses. So, you must first query the parent_repair_history 
> table to check if a repair started and finish, as well as its duration, and 
> inspect the repair_history table to troubleshoot more specific details of a 
> given repair session.
> 
> Answering your questions below:
> 
> > Is every invocation of nodetool repair execution will be recorded as one 
> > entry in parent_repair_history CF regardless if it is across DC, local node 
> > repair, or other options ?
> 
> Actually two entries, one for start and one for finish.
> 
> > A repair job is done only if "finished" column contains value? and a repair 
> > job is successfully done only if there is no value in exce ption_messages 
> > or exception_stacktrace ?
> 
> correct
> 
> > what is the purpose of successful_ranges column? do i have to check they 
> > are all matched with requested_range to ensure a successful run?
> 
> correct
> 
> -
> > Ultimately, how to find out the overall repair health/status in a given 
> > cluster?
> 
> Check if repair is being executed on all nodes within gc_grace_seconds, and 
> tune that value or troubleshoot problems otherwise.
> 
> > Scanning through parent_repair_history and making sure all the known 
> > keyspaces has a good repair run in recent days?
> 
> Sounds good.
> 
> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more 
> information.
> 
> 
> 2016-02-25 3:13 GMT-03:00 Jimmy Lin :
>> 
>> hi all,
>> few questions regarding how to read or digest the 
>> system_distributed.parent_repair_history CF, that I am very intereted to use 
>> to find out our repair status... 
>>  
>> -
>> Is every invocation of nodetool repair execution will be recorded as one 
>> entry in parent_repair_history CF regardless if it is across DC, local node 
>> repair, or other options ?
>> 
>> -
>> A repair job is done only if "finished" column contains value? and a repair 
>> job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>> what is the purpose of successful_ranges column? do i have to check they are 
>> all matched with requested_range to ensure a successful run?
>> 
>> -
>> Ultimately, how to find out the overall repair health/status in a given 
>> cluster?
>> Scanning through parent_repair_history and making sure all the known 
>> keyspaces has a good repair run in recent days?
>> 
>> ---
>> CREATE TABLE system_distributed.parent_repair_history (
>> parent_id timeuuid PRIMARY KEY,
>> columnfamily_names set,
>> exception_message text,
>> exception_stacktrace text,
>> finished_at timestamp,
>> keyspace_name text,
>> requested_ranges set,
>> started_at timestamp,
>> successful_ranges set
>> )
>

Re: Handling uncommitted paxos state

2016-02-25 Thread Carl Yeksigian

The paxos state is written to a system table (system.paxos) on each of the
paxos coordinators, so it goes through the normal write path, including
persisting to the log and being stored in a memtable until being flushed to
disk. As such, the state can survive restarts. These states are not treated
differently from our normal memtables, so there isn't any special handling
for a GC.

There is no process which will come in and fix up the values; they are
fixed at a partition level when trying to perform a CAS operation, or when
reading at a SERIAL consistency. This operation happens at the partition,
so if any part of the partition is read of updated, it will finish previous
transactions.

If you want to know more,
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
has a lot more information about lightweight transactions.

-Carl

On Thu, Feb 25, 2016 at 4:23 AM, Nicholas Wilson <
nicholas.wil...@realvnc.com> wrote:

> Hi,
>
> I have some questions about the behaviour of 'uncommitted paxos state', as
> described here:
>
> http://www.datastax.com/dev/blog/cassandra-error-handling-done-right
>
> If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS
> write, that means that the paxos phase was successful, but the data
> couldn't be committed during the final 'commit/reset' phase. On the next
> SERIAL write or read, any other node can commit the write on behalf of the
> original proposer, and must do so in fact before forming a new ballot. The
> stops the columns from getting 'stuck' if the coordinator experiences a
> network partition after forming the ballot, but before committing.
>
> My questions are on the durability of the uncommitted state:
>
> Suppose CAS writes are infrequent, and it takes weeks before another write
> is done to that column; will the paxos state still be there, waiting
> forever until the next commit, or does it get automatically committed
> during GC if you wait long enough? (I don't see how it could be cleaned up
> by a GC though, since the nodes holding the paxos state don't know if the
> ballot was won or not.)
>
> Or, what if all the nodes are switched off (briefly); is the uncommitted
> paxos state persisted to disk in the log/journal, so the write can still be
> completed when the cluster comes back online?
>
> Finally, how granular is the paxos state? Will the uncommitted write be
> completed on the next SERIAL write that touches the same exact combination
> of cells, or is it per-column across the partition, or? If the CAS
> write touches two or three cells in the row, will a subsequent SERIAL read
> from any one of those three columns complete the uncommitted state,
> presumably on the other columns as well?
>
> Thanks for your help,
> Nick
>
> ---
> Nick Wilson
> Software engineer, RealVNC

Re: how to read parent_repair_history table?

2016-02-25 Thread Paulo Motta

> why each job repair execution will have 2 entries? I thought it will be
one entry, begining with started_at column filled, and when it completed,
finished_at column will be filled.

that's correct, I was mistaken!

> Also, if my cluster has more than 1 keyspace, and the way this table is
structured, it will have multiple entries, one for each keysapce_name
value. no ? thanks

right, because repair sessions in different keyspaces will have different
repair session ids.

2016-02-25 15:04 GMT-03:00 Jimmy Lin :

> hi Paulo,
>
> follow up on the # of entries question...
>
>  why each job repair execution will have 2 entries?
> I thought it will be one entry, begining with started_at column filled, and 
> when it completed, finished_at column will be filled.
>
> Also, if my cluster has more than 1 keyspace, and the way this table is 
> structured, it will have multiple entries, one for each keysapce_name value. 
> no ?
>
> thanks
>
>
>
> Sent from my iPhone
>
> On Feb 25, 2016, at 5:48 AM, Paulo Motta  wrote:
>
> Hello Jimmy,
>
> The parent_repair_history table keeps track of start and finish
> information of a repair session.  The other table repair_history keeps
> track of repair status as it progresses. So, you must first query the
> parent_repair_history table to check if a repair started and finish, as
> well as its duration, and inspect the repair_history table to troubleshoot
> more specific details of a given repair session.
>
> Answering your questions below:
>
> > Is every invocation of nodetool repair execution will be recorded as one
> entry in parent_repair_history CF regardless if it is across DC, local node
> repair, or other options ?
>
> Actually two entries, one for start and one for finish.
>
> > A repair job is done only if "finished" column contains value? and a
> repair job is successfully done only if there is no value in exce
> ption_messages or exception_stacktrace ?
>
> correct
>
> > what is the purpose of successful_ranges column? do i have to check they
> are all matched with requested_range to ensure a successful run?
>
> correct
>
> -
> > Ultimately, how to find out the overall repair health/status in a given
> cluster?
>
> Check if repair is being executed on all nodes within gc_grace_seconds,
> and tune that value or troubleshoot problems otherwise.
>
> > Scanning through parent_repair_history and making sure all the known
> keyspaces has a good repair run in recent days?
>
> Sounds good.
>
> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
> more information.
>
>
> 2016-02-25 3:13 GMT-03:00 Jimmy Lin :
>
>>
>> hi all,
>> few questions regarding how to read or digest the
>> system_distributed.parent_repair_history CF, that I am very intereted to
>> use to find out our repair status...
>>
>> -
>> Is every invocation of nodetool repair execution will be recorded as one
>> entry in parent_repair_history CF regardless if it is across DC, local node
>> repair, or other options ?
>>
>> -
>> A repair job is done only if "finished" column contains value? and a
>> repair job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>> what is the purpose of successful_ranges column? do i have to check they
>> are all matched with requested_range to ensure a successful run?
>>
>> -
>> Ultimately, how to find out the overall repair health/status in a given
>> cluster?
>> Scanning through parent_repair_history and making sure all the known
>> keyspaces has a good repair run in recent days?
>>
>> ---
>> CREATE TABLE system_distributed.parent_repair_history (
>> parent_id timeuuid PRIMARY KEY,
>> columnfamily_names set,
>> exception_message text,
>> exception_stacktrace text,
>> finished_at timestamp,
>> keyspace_name text,
>> requested_ranges set,
>> started_at timestamp,
>> successful_ranges set
>> )
>>
>
>

Re: Handling uncommitted paxos state

2016-02-25 Thread Robert Coli

On Thu, Feb 25, 2016 at 1:23 AM, Nicholas Wilson <
nicholas.wil...@realvnc.com> wrote:

> If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS
> write, that means that the paxos phase was successful, but the data
> couldn't be committed during the final 'commit/reset' phase. On the next
> SERIAL write or read, any other node can commit the write on behalf of the
> original proposer, and must do so in fact before forming a new ballot. The
> stops the columns from getting 'stuck' if the coordinator experiences a
> network partition after forming the ballot, but before committing.
>

If you're asking these questions, you probably want to read :

https://issues.apache.org/jira/browse/CASSANDRA-9328

=Rob

Re: how to read parent_repair_history table?

2016-02-25 Thread Jimmy Lin

hi Paulo, 
one more follow up ... :)
 I noticed these tables are suppose to replicatd to all nodes in the cluster, 
and it is not per node specific. 
how does it work when repair job targeting only local vs all DC? is there any 
columns or flag i can tell the difference? or does it actualy matter?
 thanks



Sent from my iPhone

> On Feb 25, 2016, at 10:37 AM, Paulo Motta  wrote:
> 
> > why each job repair execution will have 2 entries? I thought it will be one 
> > entry, begining with started_at column filled, and when it completed, 
> > finished_at column will be filled. 
> 
> that's correct, I was mistaken!
> 
> > Also, if my cluster has more than 1 keyspace, and the way this table is 
> > structured, it will have multiple entries, one for each keysapce_name 
> > value. no ? thanks
> 
> right, because repair sessions in different keyspaces will have different 
> repair session ids.
> 
> 2016-02-25 15:04 GMT-03:00 Jimmy Lin :
>> hi Paulo, 
>> follow up on the # of entries question... 
>>  why each job repair execution will have 2 entries? I thought it will be one 
>> entry, begining with started_at column filled, and when it completed, 
>> finished_at column will be filled. 
>> Also, if my cluster has more than 1 keyspace, and the way this table is 
>> structured, it will have multiple entries, one for each keysapce_name value. 
>> no ? thanks
>> 
>> 
>> Sent from my iPhone
>> 
>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta  wrote:
>>> 
>>> Hello Jimmy,
>>> 
>>> The parent_repair_history table keeps track of start and finish information 
>>> of a repair session.  The other table repair_history keeps track of repair 
>>> status as it progresses. So, you must first query the parent_repair_history 
>>> table to check if a repair started and finish, as well as its duration, and 
>>> inspect the repair_history table to troubleshoot more specific details of a 
>>> given repair session.
>>> 
>>> Answering your questions below:
>>> 
>>> > Is every invocation of nodetool repair execution will be recorded as one 
>>> > entry in parent_repair_history CF regardless if it is across DC, local 
>>> > node repair, or other options ?
>>> 
>>> Actually two entries, one for start and one for finish.
>>> 
>>> > A repair job is done only if "finished" column contains value? and a 
>>> > repair job is successfully done only if there is no value in exce 
>>> > ption_messages or exception_stacktrace ?
>>> 
>>> correct
>>> 
>>> > what is the purpose of successful_ranges column? do i have to check they 
>>> > are all matched with requested_range to ensure a successful run?
>>> 
>>> correct
>>> 
>>> -
>>> > Ultimately, how to find out the overall repair health/status in a given 
>>> > cluster?
>>> 
>>> Check if repair is being executed on all nodes within gc_grace_seconds, and 
>>> tune that value or troubleshoot problems otherwise.
>>> 
>>> > Scanning through parent_repair_history and making sure all the known 
>>> > keyspaces has a good repair run in recent days?
>>> 
>>> Sounds good.
>>> 
>>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more 
>>> information.
>>> 
>>> 
>>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin :
 
 hi all,
 few questions regarding how to read or digest the 
 system_distributed.parent_repair_history CF, that I am very intereted to 
 use to find out our repair status... 
  
 -
 Is every invocation of nodetool repair execution will be recorded as one 
 entry in parent_repair_history CF regardless if it is across DC, local 
 node repair, or other options ?
 
 -
 A repair job is done only if "finished" column contains value? and a 
 repair job is successfully done only if there is no value in exce
 ption_messages or exception_stacktrace ?
 what is the purpose of successful_ranges column? do i have to check they 
 are all matched with requested_range to ensure a successful run?
 
 -
 Ultimately, how to find out the overall repair health/status in a given 
 cluster?
 Scanning through parent_repair_history and making sure all the known 
 keyspaces has a good repair run in recent days?
 
 ---
 CREATE TABLE system_distributed.parent_repair_history (
 parent_id timeuuid PRIMARY KEY,
 columnfamily_names set,
 exception_message text,
 exception_stacktrace text,
 finished_at timestamp,
 keyspace_name text,
 requested_ranges set,
 started_at timestamp,
 successful_ranges set
 )
>

Checking replication status

2016-02-25 Thread Jimmy Lin

hi all, 
what are the better ways to check replication overall status of cassandra 
cluster?
 within a single DC, unless a node is down for long time, most of the time i 
feel it is pretty much non-issue and things are replicated pretty fast. But 
when a node come back from a long offline, is there a way to check that the 
node has finished its data sync with other nodes  ? 
 Now across DC, we have frequent VPN outage (sometime short sometims long) 
between DCs, i also like to know if there is a way to find how the replication 
progress between DC catching up under this condtion? 
 Also, if i understand correctly, the only gaurantee way to make sure data are 
synced is to run a complete repair job, is that correct? I am trying to see if 
there is a way to "force a quick replication sync" between DCs after vpn 
outage. Or maybe this is unnecessary, as Cassandra will catch up as fast as it 
can, there is nothing else we/(system admin) can do to make it faster or better?


Sent from my iPhone

CsvReporter not spitting out metrics in cassandra

2016-02-25 Thread Vikram Kone

Hi,
I have added the following file on my cassandra node

/etc/dse/cassandra/metrics-reporter-config.yaml
csv:
  -
outdir: '/mnt/cassandra/metrics'
period: 10
timeunit: 'SECONDS'
predicate:
  color: "white"
  useQualifiedName: true
  patterns:
- "^org.apache.cassandra.metrics.Cache.+"
- "^org.apache.cassandra.metrics.ClientRequest.+"
- "^org.apache.cassandra.metrics.CommitLog.+"
- "^org.apache.cassandra.metrics.Compaction.+"
- "^org.apache.cassandra.metrics.DroppedMetrics.+"
- "^org.apache.cassandra.metrics.ReadRepair.+"
- "^org.apache.cassandra.metrics.Storage.+"
- "^org.apache.cassandra.metrics.ThreadPools.+"
- "^org.apache.cassandra.metrics.ColumnFamily.+"
- "^org.apache.cassandra.metrics.Streaming.+"


And then added this line to etc/dse/cassandra/cassandra-env.sh


JVM_OPTS="$JVM_OPTS
-Dcassandra.metricsReporterConfigFile=metrics-reporter-config.yaml

And then finally restarted DSE, /etc/init.d/dse restart

I dont see any csv metrics files being spitted out by the MetricsReported
in /mnt/cassandra/metrics folder.


any  ideas why?

Re: Checking replication status

2016-02-25 Thread daemeon reiydelle

Hmm. What are your processes when a node comes back after "a long offline"?
Long enough to take the node offline and do a repair? Run the risk of
serving stale data? Parallel repairs? ???

So, what sort of time frames are "a long time"?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin  wrote:

> hi all,
>
> what are the better ways to check replication overall status of cassandra 
> cluster?
>
>  within a single DC, unless a node is down for long time, most of the time i 
> feel it is pretty much non-issue and things are replicated pretty fast. But 
> when a node come back from a long offline, is there a way to check that the 
> node has finished its data sync with other nodes  ?
>
>  Now across DC, we have frequent VPN outage (sometime short sometims long) 
> between DCs, i also like to know if there is a way to find how the 
> replication progress between DC catching up under this condtion?
>
>  Also, if i understand correctly, the only gaurantee way to make sure data 
> are synced is to run a complete repair job,
> is that correct? I am trying to see if there is a way to "force a quick 
> replication sync" between DCs after vpn outage.
> Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, 
> there is nothing else we/(system admin) can do to make it faster or better?
>
>
>
> Sent from my iPhone
>

Re: how to read parent_repair_history table?

2016-02-25 Thread Paulo Motta

> how does it work when repair job targeting only local vs all DC? is there
any columns or flag i can tell the difference? or does it actualy matter?

You can not easily find out from the parent_repair_session table if a
repair is local-only or multi-dc. I created
https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
information to that table. Since that table only has id as primary key,
you'd need to do a full scan to perform checks on it, or keep track of the
parent id session when submitting the repair and query by primary key.

What you could probably do to health check your nodes are repaired on time
is to check for each table:

select * from repair_history where keyspace = 'ks' columnfamily_name = 'cf'
and id > mintimeuuid(now() - gc_grace_seconds/2);

And then verify for each node if all of its ranges have been repaired in
this period, and send an alert otherwise. You can find out a nodes range by
querying JMX via StorageServiceMBean.getRangeToEndpointMap.

To make this task a bit simpler you could probably add a secondary index to
the participants column of repair_history table with:

CREATE INDEX myindex ON system_distributed.repair_history (participants) ;

and check each node status individually with:

select * from repair_history where keyspace = 'ks' columnfamily_name = 'cf'
and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants CONTAINS
'node_IP';



2016-02-25 16:22 GMT-03:00 Jimmy Lin :

> hi Paulo,
>
> one more follow up ... :)
>
>  I noticed these tables are suppose to replicatd to all nodes in the cluster, 
> and it is not per node specific.
>
> how does it work when repair job targeting only local vs all DC? is there any 
> columns or flag i can tell the difference?
> or does it actualy matter?
>
>  thanks
>
>
>
>
> Sent from my iPhone
>
> On Feb 25, 2016, at 10:37 AM, Paulo Motta 
> wrote:
>
> > why each job repair execution will have 2 entries? I thought it will be
> one entry, begining with started_at column filled, and when it completed,
> finished_at column will be filled.
>
> that's correct, I was mistaken!
>
> > Also, if my cluster has more than 1 keyspace, and the way this table is
> structured, it will have multiple entries, one for each keysapce_name
> value. no ? thanks
>
> right, because repair sessions in different keyspaces will have different
> repair session ids.
>
> 2016-02-25 15:04 GMT-03:00 Jimmy Lin :
>
>> hi Paulo,
>>
>> follow up on the # of entries question...
>>
>>  why each job repair execution will have 2 entries?
>> I thought it will be one entry, begining with started_at column filled, and 
>> when it completed, finished_at column will be filled.
>>
>> Also, if my cluster has more than 1 keyspace, and the way this table is 
>> structured, it will have multiple entries, one for each keysapce_name value. 
>> no ?
>>
>> thanks
>>
>>
>>
>> Sent from my iPhone
>>
>> On Feb 25, 2016, at 5:48 AM, Paulo Motta 
>> wrote:
>>
>> Hello Jimmy,
>>
>> The parent_repair_history table keeps track of start and finish
>> information of a repair session.  The other table repair_history keeps
>> track of repair status as it progresses. So, you must first query the
>> parent_repair_history table to check if a repair started and finish, as
>> well as its duration, and inspect the repair_history table to troubleshoot
>> more specific details of a given repair session.
>>
>> Answering your questions below:
>>
>> > Is every invocation of nodetool repair execution will be recorded as
>> one entry in parent_repair_history CF regardless if it is across DC, local
>> node repair, or other options ?
>>
>> Actually two entries, one for start and one for finish.
>>
>> > A repair job is done only if "finished" column contains value? and a
>> repair job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>>
>> correct
>>
>> > what is the purpose of successful_ranges column? do i have to check
>> they are all matched with requested_range to ensure a successful run?
>>
>> correct
>>
>> -
>> > Ultimately, how to find out the overall repair health/status in a given
>> cluster?
>>
>> Check if repair is being executed on all nodes within gc_grace_seconds,
>> and tune that value or troubleshoot problems otherwise.
>>
>> > Scanning through parent_repair_history and making sure all the known
>> keyspaces has a good repair run in recent days?
>>
>> Sounds good.
>>
>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
>> more information.
>>
>>
>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin :
>>
>>>
>>> hi all,
>>> few questions regarding how to read or digest the
>>> system_distributed.parent_repair_history CF, that I am very intereted to
>>> use to find out our repair status...
>>>
>>> -
>>> Is every invocation of nodetool repair execution will be recorded as one
>>> entry in parent_repair_history CF regardless if it is across DC, local node
>>> repair, or other options ?
>>>
>>> -
>>> A repair job is done only if "finishe

Unexpected high internode network activity

2016-02-25 Thread Gianluca Borello

Hello,

We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
instances.

The configuration is pretty standard, we use the default settings that come
with the datastax AMI and the driver in our application is configured to
use lz4 compression. The keyspace where all the activity happens has RF 3
and we read and write at quorum to get strong consistency.

While analyzing our monthly bill, we noticed that the amount of network
traffic related to Cassandra was significantly higher than expected. After
breaking it down by port, it seems like over any given time, the internode
network activity is 6-7 times higher than the traffic on port 9042, whereas
we would expect something around 2-3 times, given the replication factor
and the consistency level of our queries.

For example, this is the network traffic broken down by port and direction
over a few minutes, measured as sum of each node:

Port 9042 from client to cluster (write queries): 1 GB
Port 9042 from cluster to client (read queries): 1.5 GB
Port 7000: 35 GB, which must be divided by two because the traffic is
always directed to another instance of the cluster, so that makes it 17.5
GB generated traffic

The traffic on port 9042 completely matches our expectations, we do about
100k write operations writing 10KB binary blobs for each query, and a bit
more reads on the same data.

According to our calculations, in the worst case, when the coordinator of
the query is not a replica for the data, this should generate about (1 +
1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.

Also, hinted handoffs are disabled and nodes are healthy over the period of
observation, and I get the same numbers across pretty much every time
window, even including an entire 24 hours period.

I tried to replicate this problem in a test environment so I connected a
client to a test cluster done in a bunch of Docker containers (same
parameters, essentially the only difference is the
GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
expect, the amount of traffic on port 7000 is between 2 and 3 times the
amount of traffic on port 9042 and the queries are pretty much the same
ones.

Before doing more analysis, I was wondering if someone has an explanation
on this problem, since perhaps we are missing something obvious here?

Thanks

Migrating from single node to cluster

2016-02-25 Thread Jason Kania

Hi,
I am wondering if there is any documentation on migrating from a single node 
cassandra instance to a multinode cluster? My searches have been unsuccessful 
so far and I have had no luck playing with tools due to terse output from the 
tools.

I currently use a single node having data that must be retained and I want to 
add two nodes to create a cluster. I have tried to follow the instructions at 
the link below but it is unclear if it even works to go from 1 node to 2.
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html
Almost no data has been transferred across and nodetool status is showing that 
0% of the data is owned by either node although I cannot determine what the 
percentages should be in the case that the configuration is intended for data 
redundancy.
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  192.168.10.8  648.16 MB  256  0.0%  
5ce4f8ff-3ba4-41b2-8fd5-7d00d98c415f  rack1
UN  192.168.10.9  3.31 MB    256  0.0%  
b56f6d58-0f60-473f-b202-f43ecc7a83f5  rack1

I also looked to see if there were any tools to check whether replication is in 
progress but had no luck.

The second node is bootstrapped and nodetool repair indicates that nothing 
needs to be done.
Any suggestions on a path to take? I am at a loss.

Thanks,
Jason

Re: how to read parent_repair_history table?

2016-02-25 Thread Jimmy Lin

hi Paulo,
that is right, I forgot there is another table that actually tracking the
rest of the detail of the repairs.
thanks for the pointers, will explore more with those info.

I am actually surprised not much doc out there talk about these two tables,
or other tools or utilities harvesting these data. (?)

thanks



On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta 
wrote:

> > how does it work when repair job targeting only local vs all DC? is
> there any columns or flag i can tell the difference? or does it actualy
> matter?
>
> You can not easily find out from the parent_repair_session table if a
> repair is local-only or multi-dc. I created
> https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
> information to that table. Since that table only has id as primary key,
> you'd need to do a full scan to perform checks on it, or keep track of the
> parent id session when submitting the repair and query by primary key.
>
> What you could probably do to health check your nodes are repaired on time
> is to check for each table:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2);
>
> And then verify for each node if all of its ranges have been repaired in
> this period, and send an alert otherwise. You can find out a nodes range by
> querying JMX via StorageServiceMBean.getRangeToEndpointMap.
>
> To make this task a bit simpler you could probably add a secondary index
> to the participants column of repair_history table with:
>
> CREATE INDEX myindex ON system_distributed.repair_history (participants) ;
>
> and check each node status individually with:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants
> CONTAINS 'node_IP';
>
>
>
> 2016-02-25 16:22 GMT-03:00 Jimmy Lin :
>
>> hi Paulo,
>>
>> one more follow up ... :)
>>
>>  I noticed these tables are suppose to replicatd to all nodes in the 
>> cluster, and it is not per node specific.
>>
>> how does it work when repair job targeting only local vs all DC? is there 
>> any columns or flag i can tell the difference?
>> or does it actualy matter?
>>
>>  thanks
>>
>>
>>
>>
>> Sent from my iPhone
>>
>> On Feb 25, 2016, at 10:37 AM, Paulo Motta 
>> wrote:
>>
>> > why each job repair execution will have 2 entries? I thought it will
>> be one entry, begining with started_at column filled, and when it
>> completed, finished_at column will be filled.
>>
>> that's correct, I was mistaken!
>>
>> > Also, if my cluster has more than 1 keyspace, and the way this table
>> is structured, it will have multiple entries, one for each keysapce_name
>> value. no ? thanks
>>
>> right, because repair sessions in different keyspaces will have different
>> repair session ids.
>>
>> 2016-02-25 15:04 GMT-03:00 Jimmy Lin :
>>
>>> hi Paulo,
>>>
>>> follow up on the # of entries question...
>>>
>>>  why each job repair execution will have 2 entries?
>>> I thought it will be one entry, begining with started_at column filled, and 
>>> when it completed, finished_at column will be filled.
>>>
>>> Also, if my cluster has more than 1 keyspace, and the way this table is 
>>> structured, it will have multiple entries, one for each keysapce_name 
>>> value. no ?
>>>
>>> thanks
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta 
>>> wrote:
>>>
>>> Hello Jimmy,
>>>
>>> The parent_repair_history table keeps track of start and finish
>>> information of a repair session.  The other table repair_history keeps
>>> track of repair status as it progresses. So, you must first query the
>>> parent_repair_history table to check if a repair started and finish, as
>>> well as its duration, and inspect the repair_history table to troubleshoot
>>> more specific details of a given repair session.
>>>
>>> Answering your questions below:
>>>
>>> > Is every invocation of nodetool repair execution will be recorded as
>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>> node repair, or other options ?
>>>
>>> Actually two entries, one for start and one for finish.
>>>
>>> > A repair job is done only if "finished" column contains value? and a
>>> repair job is successfully done only if there is no value in exce
>>> ption_messages or exception_stacktrace ?
>>>
>>> correct
>>>
>>> > what is the purpose of successful_ranges column? do i have to check
>>> they are all matched with requested_range to ensure a successful run?
>>>
>>> correct
>>>
>>> -
>>> > Ultimately, how to find out the overall repair health/status in a
>>> given cluster?
>>>
>>> Check if repair is being executed on all nodes within gc_grace_seconds,
>>> and tune that value or troubleshoot problems otherwise.
>>>
>>> > Scanning through parent_repair_history and making sure all the known
>>> keyspaces has a good repair run in recent days?
>>>
>>> Sounds good.
>>>
>>> You can check https://issues.apache.org/jira/

Re: Checking replication status

2016-02-25 Thread Jimmy Lin

so far they are not long, just some config change and restart.
if it is a 2 hrs downtime due to whatever reason, a repair is better option
than trying to figure out if replication syn finish or not?

On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle 
wrote:

> Hmm. What are your processes when a node comes back after "a long
> offline"? Long enough to take the node offline and do a repair? Run the
> risk of serving stale data? Parallel repairs? ???
>
> So, what sort of time frames are "a long time"?
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin  wrote:
>
>> hi all,
>>
>> what are the better ways to check replication overall status of cassandra 
>> cluster?
>>
>>  within a single DC, unless a node is down for long time, most of the time i 
>> feel it is pretty much non-issue and things are replicated pretty fast. But 
>> when a node come back from a long offline, is there a way to check that the 
>> node has finished its data sync with other nodes  ?
>>
>>  Now across DC, we have frequent VPN outage (sometime short sometims long) 
>> between DCs, i also like to know if there is a way to find how the 
>> replication progress between DC catching up under this condtion?
>>
>>  Also, if i understand correctly, the only gaurantee way to make sure data 
>> are synced is to run a complete repair job,
>> is that correct? I am trying to see if there is a way to "force a quick 
>> replication sync" between DCs after vpn outage.
>> Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, 
>> there is nothing else we/(system admin) can do to make it faster or better?
>>
>>
>>
>> Sent from my iPhone
>>
>
>

Re: Unexpected high internode network activity

2016-02-25 Thread daemeon reiydelle

If read & write at quorum then you write 3 copies of the data then return
to the caller; when reading you read one copy (assume it is not on the
coordinator), and 1 digest (because read at quorum is 2, not 3).

When you insert, how many keyspaces get written to? (Are you using e.g.
inverted indices?) That is my guess, that your db has about 1.8 bytes
written for every byte inserted.

Every byte you write is counted also as a read (system a sends 1gb to
system b, so system b receives 1gb). You would not be charged if intra AZ,
but inter AZ and inter DC will get that double count.

So, my guess is reverse indexes, and you forgot to include receive and
transmit.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello 
wrote:

> Hello,
>
> We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
> There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
> instances.
>
> The configuration is pretty standard, we use the default settings that
> come with the datastax AMI and the driver in our application is configured
> to use lz4 compression. The keyspace where all the activity happens has RF
> 3 and we read and write at quorum to get strong consistency.
>
> While analyzing our monthly bill, we noticed that the amount of network
> traffic related to Cassandra was significantly higher than expected. After
> breaking it down by port, it seems like over any given time, the internode
> network activity is 6-7 times higher than the traffic on port 9042, whereas
> we would expect something around 2-3 times, given the replication factor
> and the consistency level of our queries.
>
> For example, this is the network traffic broken down by port and direction
> over a few minutes, measured as sum of each node:
>
> Port 9042 from client to cluster (write queries): 1 GB
> Port 9042 from cluster to client (read queries): 1.5 GB
> Port 7000: 35 GB, which must be divided by two because the traffic is
> always directed to another instance of the cluster, so that makes it 17.5
> GB generated traffic
>
> The traffic on port 9042 completely matches our expectations, we do about
> 100k write operations writing 10KB binary blobs for each query, and a bit
> more reads on the same data.
>
> According to our calculations, in the worst case, when the coordinator of
> the query is not a replica for the data, this should generate about (1 +
> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>
> Also, hinted handoffs are disabled and nodes are healthy over the period
> of observation, and I get the same numbers across pretty much every time
> window, even including an entire 24 hours period.
>
> I tried to replicate this problem in a test environment so I connected a
> client to a test cluster done in a bunch of Docker containers (same
> parameters, essentially the only difference is the
> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
> expect, the amount of traffic on port 7000 is between 2 and 3 times the
> amount of traffic on port 9042 and the queries are pretty much the same
> ones.
>
> Before doing more analysis, I was wondering if someone has an explanation
> on this problem, since perhaps we are missing something obvious here?
>
> Thanks
>
>
>

Re: Unexpected high internode network activity

2016-02-25 Thread Gianluca Borello

Thank you for your reply.

To answer your points:

- I fully agree on the write volume, in fact my isolated tests confirm
your estimation

- About the read, I agree as well, but the volume of data is still much
higher

- I am writing to one single keyspace with RF 3, there's just one keyspace

- I am not using any indexes, the column families are very simple

- I am aware of the double count, in fact, I measured the traffic on port
9042 at the client side (so just counted once) and I divided by two the
traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
measurements have been done with iftop with proper bpf filters on the
port and the total traffic matches what I see in cloudwatch (divided by two)

So unfortunately I still don't have any ideas about what's going on and why
I'm seeing 17 GB of internode traffic instead of ~ 5-6.

On Thursday, February 25, 2016, daemeon reiydelle 
wrote:

> If read & write at quorum then you write 3 copies of the data then return
> to the caller; when reading you read one copy (assume it is not on the
> coordinator), and 1 digest (because read at quorum is 2, not 3).
>
> When you insert, how many keyspaces get written to? (Are you using e.g.
> inverted indices?) That is my guess, that your db has about 1.8 bytes
> written for every byte inserted.
>
> Every byte you write is counted also as a read (system a sends 1gb to
> system b, so system b receives 1gb). You would not be charged if intra AZ,
> but inter AZ and inter DC will get that double count.
>
> So, my guess is reverse indexes, and you forgot to include receive and
> transmit.
> 
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>
> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello  > wrote:
>
>> Hello,
>>
>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>> c3.2xlarge instances.
>>
>> The configuration is pretty standard, we use the default settings that
>> come with the datastax AMI and the driver in our application is configured
>> to use lz4 compression. The keyspace where all the activity happens has RF
>> 3 and we read and write at quorum to get strong consistency.
>>
>> While analyzing our monthly bill, we noticed that the amount of network
>> traffic related to Cassandra was significantly higher than expected. After
>> breaking it down by port, it seems like over any given time, the internode
>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>> we would expect something around 2-3 times, given the replication factor
>> and the consistency level of our queries.
>>
>> For example, this is the network traffic broken down by port and
>> direction over a few minutes, measured as sum of each node:
>>
>> Port 9042 from client to cluster (write queries): 1 GB
>> Port 9042 from cluster to client (read queries): 1.5 GB
>> Port 7000: 35 GB, which must be divided by two because the traffic is
>> always directed to another instance of the cluster, so that makes it 17.5
>> GB generated traffic
>>
>> The traffic on port 9042 completely matches our expectations, we do about
>> 100k write operations writing 10KB binary blobs for each query, and a bit
>> more reads on the same data.
>>
>> According to our calculations, in the worst case, when the coordinator of
>> the query is not a replica for the data, this should generate about (1 +
>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>
>> Also, hinted handoffs are disabled and nodes are healthy over the period
>> of observation, and I get the same numbers across pretty much every time
>> window, even including an entire 24 hours period.
>>
>> I tried to replicate this problem in a test environment so I connected a
>> client to a test cluster done in a bunch of Docker containers (same
>> parameters, essentially the only difference is the
>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>> amount of traffic on port 9042 and the queries are pretty much the same
>> ones.
>>
>> Before doing more analysis, I was wondering if someone has an explanation
>> on this problem, since perhaps we are missing something obvious here?
>>
>> Thanks
>>
>>
>>
>

Re: Unexpected high internode network activity

2016-02-25 Thread daemeon reiydelle

Intriguing. It's enough data to look like full data is coming from the
replicants instead of digests when the read of the copy occurs. Are you
doing backup/dr? Are directories copied regularly and over the network or ?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello 
wrote:

> Thank you for your reply.
>
> To answer your points:
>
> - I fully agree on the write volume, in fact my isolated tests confirm
> your estimation
>
> - About the read, I agree as well, but the volume of data is still much
> higher
>
> - I am writing to one single keyspace with RF 3, there's just one keyspace
>
> - I am not using any indexes, the column families are very simple
>
> - I am aware of the double count, in fact, I measured the traffic on port
> 9042 at the client side (so just counted once) and I divided by two the
> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
> measurements have been done with iftop with proper bpf filters on the
> port and the total traffic matches what I see in cloudwatch (divided by two)
>
> So unfortunately I still don't have any ideas about what's going on and
> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>
> On Thursday, February 25, 2016, daemeon reiydelle 
> wrote:
>
>> If read & write at quorum then you write 3 copies of the data then return
>> to the caller; when reading you read one copy (assume it is not on the
>> coordinator), and 1 digest (because read at quorum is 2, not 3).
>>
>> When you insert, how many keyspaces get written to? (Are you using e.g.
>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>> written for every byte inserted.
>>
>> Every byte you write is counted also as a read (system a sends 1gb to
>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>> but inter AZ and inter DC will get that double count.
>>
>> So, my guess is reverse indexes, and you forgot to include receive and
>> transmit.
>> 
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello 
>> wrote:
>>
>>> Hello,
>>>
>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>> c3.2xlarge instances.
>>>
>>> The configuration is pretty standard, we use the default settings that
>>> come with the datastax AMI and the driver in our application is configured
>>> to use lz4 compression. The keyspace where all the activity happens has RF
>>> 3 and we read and write at quorum to get strong consistency.
>>>
>>> While analyzing our monthly bill, we noticed that the amount of network
>>> traffic related to Cassandra was significantly higher than expected. After
>>> breaking it down by port, it seems like over any given time, the internode
>>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>>> we would expect something around 2-3 times, given the replication factor
>>> and the consistency level of our queries.
>>>
>>> For example, this is the network traffic broken down by port and
>>> direction over a few minutes, measured as sum of each node:
>>>
>>> Port 9042 from client to cluster (write queries): 1 GB
>>> Port 9042 from cluster to client (read queries): 1.5 GB
>>> Port 7000: 35 GB, which must be divided by two because the traffic is
>>> always directed to another instance of the cluster, so that makes it 17.5
>>> GB generated traffic
>>>
>>> The traffic on port 9042 completely matches our expectations, we do
>>> about 100k write operations writing 10KB binary blobs for each query, and a
>>> bit more reads on the same data.
>>>
>>> According to our calculations, in the worst case, when the coordinator
>>> of the query is not a replica for the data, this should generate about (1 +
>>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>>
>>> Also, hinted handoffs are disabled and nodes are healthy over the period
>>> of observation, and I get the same numbers across pretty much every time
>>> window, even including an entire 24 hours period.
>>>
>>> I tried to replicate this problem in a test environment so I connected a
>>> client to a test cluster done in a bunch of Docker containers (same
>>> parameters, essentially the only difference is the
>>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>>> amount of traffic on port 9042 and the queries are pretty much the same
>>> ones.
>>>
>>> Before doing more analysis, I was wondering if someone has an
>>> explanation on this problem, since perhaps we are missing something obvious
>>> here?
>>>
>>> Thanks
>>>
>>>
>>>
>>

Re: Unexpected high internode network activity

2016-02-25 Thread Gianluca Borello

It is indeed very intriguing and I really hope to learn more from the
experience of this mailing list. To address your points:

- The theory that full data is coming from replicas during reads is not
enough to explain the situation. In my scenario, over a time window I had
17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
reads (measured on port 9042), so even if both reads and writes affected
all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
port 7000 unaccounted

- We are doing regular backups the standard way, using periodic snapshots
and synchronizing them to S3. This traffic is not part of the anomalous
traffic we're seeing above, since this one goes on port 80 and it's clearly
visible with a separate bpf filter, and its magnitude is far lower than
that anyway

Thanks

On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle 
wrote:

> Intriguing. It's enough data to look like full data is coming from the
> replicants instead of digests when the read of the copy occurs. Are you
> doing backup/dr? Are directories copied regularly and over the network or ?
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello 
> wrote:
>
>> Thank you for your reply.
>>
>> To answer your points:
>>
>> - I fully agree on the write volume, in fact my isolated tests confirm
>> your estimation
>>
>> - About the read, I agree as well, but the volume of data is still much
>> higher
>>
>> - I am writing to one single keyspace with RF 3, there's just one
>> keyspace
>>
>> - I am not using any indexes, the column families are very simple
>>
>> - I am aware of the double count, in fact, I measured the traffic on port
>> 9042 at the client side (so just counted once) and I divided by two the
>> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
>> measurements have been done with iftop with proper bpf filters on the
>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>
>> So unfortunately I still don't have any ideas about what's going on and
>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>
>> On Thursday, February 25, 2016, daemeon reiydelle 
>> wrote:
>>
>>> If read & write at quorum then you write 3 copies of the data then
>>> return to the caller; when reading you read one copy (assume it is not on
>>> the coordinator), and 1 digest (because read at quorum is 2, not 3).
>>>
>>> When you insert, how many keyspaces get written to? (Are you using e.g.
>>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>>> written for every byte inserted.
>>>
>>> Every byte you write is counted also as a read (system a sends 1gb to
>>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>>> but inter AZ and inter DC will get that double count.
>>>
>>> So, my guess is reverse indexes, and you forgot to include receive and
>>> transmit.
>>> 
>>>
>>>
>>> *...*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello 
>>> wrote:
>>>
 Hello,

 We have a Cassandra 2.1.9 cluster on EC2 for one of our live
 applications. There's a total of 21 nodes across 3 AWS availability zones,
 c3.2xlarge instances.

 The configuration is pretty standard, we use the default settings that
 come with the datastax AMI and the driver in our application is configured
 to use lz4 compression. The keyspace where all the activity happens has RF
 3 and we read and write at quorum to get strong consistency.

 While analyzing our monthly bill, we noticed that the amount of network
 traffic related to Cassandra was significantly higher than expected. After
 breaking it down by port, it seems like over any given time, the internode
 network activity is 6-7 times higher than the traffic on port 9042, whereas
 we would expect something around 2-3 times, given the replication factor
 and the consistency level of our queries.

 For example, this is the network traffic broken down by port and
 direction over a few minutes, measured as sum of each node:

 Port 9042 from client to cluster (write queries): 1 GB
 Port 9042 from cluster to client (read queries): 1.5 GB
 Port 7000: 35 GB, which must be divided by two because the traffic is
 always directed to another instance of the cluster, so that makes it 17.5
 GB generated traffic

 The traffic on port 9042 completely matches our expectations, we do
 about 100k write operations writing 10KB binary blobs for each query, and a
 bit more reads on the same data.

 According to our calculations, in the worst case, wh

RE: CsvReporter not spitting out metrics in cassandra

2016-02-25 Thread Leleu Eric

Hi,

I configured this reporter recently thought the Apache Cassandra v2.1.x and I 
had no troubles.
Here is some points to check :

-  The directory “/etc/dse/Cassandra” has to be in the classpath (I’m 
not a DSE user so I don’t know if it is already the case.)

-  If the CVSReporter fails to start (rights issue on the output 
directory?), you should have some logs with ERROR level into your Cassandra log 
files.

Eric

De : Vikram Kone [mailto:vikramk...@gmail.com]
Envoyé : jeudi 25 février 2016 21:41
À : user@cassandra.apache.org
Objet : CsvReporter not spitting out metrics in cassandra

Hi,
I have added the following file on my cassandra node

/etc/dse/cassandra/metrics-reporter-config.yaml
csv:
  -
outdir: '/mnt/cassandra/metrics'
period: 10
timeunit: 'SECONDS'
predicate:
  color: "white"
  useQualifiedName: true
  patterns:
- "^org.apache.cassandra.metrics.Cache.+"
- "^org.apache.cassandra.metrics.ClientRequest.+"
- "^org.apache.cassandra.metrics.CommitLog.+"
- "^org.apache.cassandra.metrics.Compaction.+"
- "^org.apache.cassandra.metrics.DroppedMetrics.+"
- "^org.apache.cassandra.metrics.ReadRepair.+"
- "^org.apache.cassandra.metrics.Storage.+"
- "^org.apache.cassandra.metrics.ThreadPools.+"
- "^org.apache.cassandra.metrics.ColumnFamily.+"
- "^org.apache.cassandra.metrics.Streaming.+"


And then added this line to etc/dse/cassandra/cassandra-env.sh


JVM_OPTS="$JVM_OPTS 
-Dcassandra.metricsReporterConfigFile=metrics-reporter-config.yaml

And then finally restarted DSE, /etc/init.d/dse restart

I dont see any csv metrics files being spitted out by the MetricsReported in 
/mnt/cassandra/metrics folder.


any  ideas why?

!!!*
"Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Worldline liability cannot be triggered for the 
message content. Although the sender endeavours to maintain a computer 
virus-free network, the sender does not warrant that this transmission is 
virus-free and will not be liable for any damages resulting from any virus 
transmitted.!!!"

Re: Unexpected high internode network activity

2016-02-25 Thread daemeon reiydelle

Hmm. From the AWS FAQ:

*Q: If I have two instances in different availability zones, how will I be
charged for regional data transfer?*

Each instance is charged for its data in and data out. Therefore, if data
is transferred between these two instances, it is charged out for the first
instance and in for the second instance.


I really am not seeing this factored into your numbers fully. If data
transfer is only twice as much as expected, the above billing would seem to
put the numbers in line. Since (I assume) you have one copy in EACH AZ (dc
aware but really dc=az) I am not seeing the bandwidth as that much out of
line.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 11:00 PM, Gianluca Borello 
wrote:

> It is indeed very intriguing and I really hope to learn more from the
> experience of this mailing list. To address your points:
>
> - The theory that full data is coming from replicas during reads is not
> enough to explain the situation. In my scenario, over a time window I had
> 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
> reads (measured on port 9042), so even if both reads and writes affected
> all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
> port 7000 unaccounted
>
> - We are doing regular backups the standard way, using periodic snapshots
> and synchronizing them to S3. This traffic is not part of the anomalous
> traffic we're seeing above, since this one goes on port 80 and it's clearly
> visible with a separate bpf filter, and its magnitude is far lower than
> that anyway
>
> Thanks
>
> On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle 
> wrote:
>
>> Intriguing. It's enough data to look like full data is coming from the
>> replicants instead of digests when the read of the copy occurs. Are you
>> doing backup/dr? Are directories copied regularly and over the network or ?
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello 
>> wrote:
>>
>>> Thank you for your reply.
>>>
>>> To answer your points:
>>>
>>> - I fully agree on the write volume, in fact my isolated tests confirm
>>> your estimation
>>>
>>> - About the read, I agree as well, but the volume of data is still much
>>> higher
>>>
>>> - I am writing to one single keyspace with RF 3, there's just one
>>> keyspace
>>>
>>> - I am not using any indexes, the column families are very simple
>>>
>>> - I am aware of the double count, in fact, I measured the traffic on
>>> port 9042 at the client side (so just counted once) and I divided by two
>>> the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All
>>> the measurements have been done with iftop with proper bpf filters on the
>>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>>
>>> So unfortunately I still don't have any ideas about what's going on and
>>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>>
>>> On Thursday, February 25, 2016, daemeon reiydelle 
>>> wrote:
>>>
 If read & write at quorum then you write 3 copies of the data then
 return to the caller; when reading you read one copy (assume it is not on
 the coordinator), and 1 digest (because read at quorum is 2, not 3).

 When you insert, how many keyspaces get written to? (Are you using e.g.
 inverted indices?) That is my guess, that your db has about 1.8 bytes
 written for every byte inserted.

 Every byte you write is counted also as a read (system a sends 1gb to
 system b, so system b receives 1gb). You would not be charged if intra AZ,
 but inter AZ and inter DC will get that double count.

 So, my guess is reverse indexes, and you forgot to include receive and
 transmit.
 


 *...*



 *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
 <%28%2B44%29%20%280%29%2020%208144%209872>*

 On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello 
 wrote:

> Hello,
>
> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
> applications. There's a total of 21 nodes across 3 AWS availability zones,
> c3.2xlarge instances.
>
> The configuration is pretty standard, we use the default settings that
> come with the datastax AMI and the driver in our application is configured
> to use lz4 compression. The keyspace where all the activity happens has RF
> 3 and we read and write at quorum to get strong consistency.
>
> While analyzing our monthly bill, we noticed that the amount of
> network traffic related to Cassandra was significantly higher than
> expected. After breaking it down by port, it seems like over any given
> time, the internode

Re: Cassandra nodes reduce disks per node

Handling uncommitted paxos state

Re: Cassandra nodes reduce disks per node

Re: Cassandra Data Audit

Re: Cassandra nodes reduce disks per node

Re: how to read parent_repair_history table?

Consistent read timeouts for bursts of reads

Re: Consistent read timeouts for bursts of reads

Re: how to read parent_repair_history table?

Re: how to read parent_repair_history table?

Re: how to read parent_repair_history table?

Re: Handling uncommitted paxos state

Re: how to read parent_repair_history table?

Re: Handling uncommitted paxos state

Re: how to read parent_repair_history table?

Checking replication status

CsvReporter not spitting out metrics in cassandra

Re: Checking replication status

Re: how to read parent_repair_history table?

Unexpected high internode network activity

Migrating from single node to cluster

Re: how to read parent_repair_history table?

Re: Checking replication status

Re: Unexpected high internode network activity

Re: Unexpected high internode network activity

Re: Unexpected high internode network activity

Re: Unexpected high internode network activity

RE: CsvReporter not spitting out metrics in cassandra

Re: Unexpected high internode network activity

29 matches

Site Navigation

Mail list logo

Footer information