from:"Mike"

Re: Cassandra compression not working?

2012-09-24 Thread Mike

I forgot to mention we are running Cassandra 1.1.2.

Thanks,
-Mike

On Sep 24, 2012, at 5:00 PM, Michael Theroux  wrote:

> Hello,
> 
> We are running into an unusual situation that I'm wondering if anyone has any 
> insight on.  We've been running a Cassandra cluster for some time, with 
> compression enabled on one column family in which text documents are stored.  
> We enabled compression on the column family, utilizing the SnappyCompressor 
> and a 64k chunk length.
> 
> It was recently discovered that Cassandra was reporting a compression ratio 
> of 0.  I took a snapshot of the data and started a cassandra node in 
> isolation to investigate.
> 
> Running nodetool scrub, or nodetool upgradesstables had little impact on the 
> amount of data that was being stored.
> 
> I then disabled compression and ran nodetool upgradesstables on the column 
> family.  Again, not impact on the data size stored.
> 
> I then reenabled compression and ran nodetool upgradesstables on the column 
> family.  This resulting in a 60% reduction in the data size stored, and 
> Cassandra reporting a compression ration of about .38.
> 
> Any idea what is going on here?  Obviously I can go through this process in 
> production to enable compression, however, any idea what is currently 
> happening and why new data does not appear to be compressed?
> 
> Any insights are appreciated,
> Thanks,
> -Mike

Row caching + Wide row column family == almost crashed?

2012-12-02 Thread Mike


Hello,

We recently hit an issue within our Cassandra based application.  We
have a relatively new Column Family with some very wide rows (10's of
thousands of columns, or more in some cases).  During a periodic
activity, we the range of columns to retrieve various pieces of
information, a segment at a time.

We do these same queries frequently at various stages of the process,
and I thought the application could see a performance benefit from row
caching.  We have a small row cache (100MB per node) already enabled,
and I enabled row caching on the new column family.

The results were very negative.  When performing range queries with a
limit of 200 results, for a small minority of the rows in the new column
family, performance plummeted.  CPU utilization on the Cassandra node
went through the roof, and it started chewing up memory.  Some queries
to this column family hung completely.

According to the logs, we started getting frequent GCInspector
messages.  Cassandra started flushing the largest mem_tables due to
hitting the "flush_largest_memtables_at" of 75%, and scaling back the
key/row caches.  However, to Cassandra's credit, it did not die with an
OutOfMemory error.  Its measures to emergency measures to conserve
memory worked, and the cluster stayed up and running.  No real errors
showed in the logs, except for Messages getting drop, which I believe
was caused by what was going on with CPU and memory.

Disabling row caching on this new column family has resolved the issue
for now, but, is there something fundamental about row caching that I am
missing?

We are running Cassandra 1.1.2 with a 6 node cluster, with a replication
factor of 3.

Thanks,
-Mike

Re: Row caching + Wide row column family == almost crashed?

2012-12-04 Thread Mike


Thanks for all the responses!

On 12/3/2012 6:55 PM, Bill de hÓra wrote:
A Cassandra JVM will generally not function well with with caches and 
wide rows. Probably the most important thing to understand is Ed's 
point, that the row cache caches the entire row, not just the slice 
that was read out. What you've seen is almost exactly the observed 
behaviour I'd expect with enabling either cache provider over wide rows.


 - the on-heap cache will result in evictions that crush the JVM 
trying to manage garbage. This is also the case so if the rows have an 
uneven size distribution (as small rows can push out a single large 
row, large rows push out many small ones, etc).


 - the off heap cache will spend a lot of time serializing and 
deserializing wide rows, such that it can increase latency relative to 
just reading from disk and leverage the filesystem's cache directly.


The cache resizing behaviour does exist to preserve the server's 
memory, but it can also cause a death spiral in the on-heap case, 
because a relatively smaller cache may result in data being evicted 
more frequently.  I've seen cases where sizing up the cache can 
stabilise a server's memory.


This isn't just a Cassandra thing, it simply happens to be very 
evident with that system - generally to get an effective benefit from 
a cache, the data should be contiguously sized and not too large to 
allow effective cache 'lining'.


Bill

On 02/12/12 21:36, Mike wrote:

Hello,

We recently hit an issue within our Cassandra based application.  We
have a relatively new Column Family with some very wide rows (10's of
thousands of columns, or more in some cases).  During a periodic
activity, we the range of columns to retrieve various pieces of
information, a segment at a time.

We do these same queries frequently at various stages of the process,
and I thought the application could see a performance benefit from row
caching.  We have a small row cache (100MB per node) already enabled,
and I enabled row caching on the new column family.

The results were very negative.  When performing range queries with a
limit of 200 results, for a small minority of the rows in the new column
family, performance plummeted.  CPU utilization on the Cassandra node
went through the roof, and it started chewing up memory.  Some queries
to this column family hung completely.

According to the logs, we started getting frequent GCInspector
messages.  Cassandra started flushing the largest mem_tables due to
hitting the "flush_largest_memtables_at" of 75%, and scaling back the
key/row caches.  However, to Cassandra's credit, it did not die with an
OutOfMemory error.  Its measures to emergency measures to conserve
memory worked, and the cluster stayed up and running.  No real errors
showed in the logs, except for Messages getting drop, which I believe
was caused by what was going on with CPU and memory.

Disabling row caching on this new column family has resolved the issue
for now, but, is there something fundamental about row caching that I am
missing?

We are running Cassandra 1.1.2 with a 6 node cluster, with a replication
factor of 3.

Thanks,
-Mike

Diagnosing memory issues

2012-12-04 Thread Mike

usLogger.java 
(line 116) system.LocationInfo   0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java 
(line 116) system.Versions   0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java 
(line 116) system.schema_keyspaces   0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java 
(line 116) system.Migrations 0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java 
(line 116) system.schema_columnfamilies 0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java 
(line 116) system.schema_columns 0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java 
(line 116) system.HintsColumnFamily  0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java 
(line 116) system.Schema 0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java 
(line 116) open.comp   0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java 
(line 116) open.bp0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java 
(line 116) open.bn  312832,47184787
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java 
(line 116) open.p   711,193201
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java 
(line 116) open.bid273064,46316018
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java 
(line 116) open.rel   0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java 
(line 116) open.images  0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java 
(line 116) open.users62287,86665510
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java 
(line 116) open.sessions  4710,13153051
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java 
(line 116) open.userIndices  4,1960
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java 
(line 116) open.caches   50,4813457
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java 
(line 116) open.content 0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,710 StatusLogger.java 
(line 116) open.enrich30,20793
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,744 StatusLogger.java 
(line 116) open.bt1133,776831
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,863 StatusLogger.java 
(line 116) open.alias  253,163933
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,864 StatusLogger.java 
(line 116) open.bymsgid 249610,73075517
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,864 StatusLogger.java 
(line 116) open.rank319956,70898417
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java 
(line 116) open.cmap 448,406193
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java 
(line 116) open.pmap659,566220
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java 
(line 116) open.pict 50944,58659596
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,878 StatusLogger.java 
(line 116) open.w0,0
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java 
(line 116) open.s  92395,46160381
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java 
(line 116) open.bymrel   136607,57780555
 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java 
(line 116) open.m 26720,51150067


It's appreciated,
Thanks,
-Mike

Re: Diagnosing memory issues

2012-12-11 Thread Mike

Thank you for the response.  Since the time of this question, we've 
identified a number of areas that needed improving and have helped 
things along quite a bit.  To answer your question, we were seeing both 
ParNew and CMS.  There were no errors in the log, and all the nodes have 
been up.


However, we are seeing one interesting issue.  We are running a 6 node 
cluster with a Replication Factor of 3.  The nodes are pretty evenly 
balanced.  All reads and writes to Cassandra uses LOCAL_QUORUM 
consistency.  We are seeing a very interesting problem from the JMX 
statistics.  We discovered we had one column family with an extremely 
high and unexpected write count.  The writes to this column family are 
done in conjunction with other writes to other column families such that 
their numbers should be roughly equivalent, but they are off by a factor 
of 10.   We have yet to find anything in our code that could cause this 
discrepancy in numbers.


What really interesting is that we see this behavior on only 5 of the 6 
nodes in our cluster.  On 5 of the 6 nodes, we see statistics indicating 
we are writing two fast and this specific memtable is exceeding its 
memtable 128M limit, while this one other node seems to be handling the 
load ok (Memtables stay within their limites).  Given our replication 
factor, I'm not sure how this is possible.


Any hints on what might be causing this additional load?  Are there 
other activities in Cassandra might account for this increased load on a 
single column family?


Any insights would be appreciated,
-Mike


On 12/4/2012 3:33 PM, aaron morton wrote:
For background, a discussion on estimating working set 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html . 
You can also just look at the size of tenured heap after a CMS.


Are you seeing lots of ParNew or CMS ?

GC activity is a result of configuration *and* workload. Look in your 
data model for wide rows, or long lived rows that get a lot of 
deletes, and look in your code for large reads / writes (e.g. 
sometimes we read 100,000 columns from a row).


The number that really jumps out at me below is the number of Pending 
requests for the Message Service.  24,000+ pending requests.
INFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java 
(line 89) MessagingServicen/a24,229

Technically speaking that ain't right.
The whole server looks unhappy.

Are there any errors in the logs ?
Are all the nodes up ?

A very blunt approach is to reduce the in_memory_compaction_limit and 
the concurrent_compactors or compaction_throughput_mb_per_sec. This 
reduces the impact compaction and repair have on the system and may 
give you breathing space to look at other causes. Once you have a feel 
for what's going on you can turn them up.


Hope that helps.
A

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/12/2012, at 7:04 AM, Mike <mailto:mthero...@yahoo.com>> wrote:



Hello,

Our Cassandra cluster has, relatively recently, started experiencing 
memory pressure that I am in the midsts of diagnosing.  Our system 
has uneven levels of traffic, relatively light during the day, but 
extremely heavy during some overnight processing.  We have started 
getting a message:


WARN [ScheduledTasks:1] 2012-12-04 09:08:58,579 GCInspector.java 
(line 145) Heap is 0.7520105072262254 full.  You may need to reduce 
memtable and/or cache sizes.  Cassandra will now flush up to the two 
largest memtables to free up memory.  Adjust 
flush_largest_memtables_at threshold in cassandra.yaml if you don't 
want Cassandra to do this automatically


I've started implementing some instrumentation to gather stats from 
JMX to determine what is happening.  However, last night, the 
GCInspector was kind enough to log the information below.  Couple of 
things jumped out at me.


The maximum heap for the Cassandra is 4GB.  We are running Cassandra 
1.1.2, on a 6 node cluster, with a replication factor of 3.  All our 
queries use LOCAL_QUORUM consistency.


Adding up the caches + the memtable "data" in the trace below, comes 
to under 600MB


The number that really jumps out at me below is the number of Pending 
requests for the Message Service.  24,000+ pending requests.


Does this number represent the number of outstanding client requests 
that this node is processing?  If so, does this mean we potentially 
have 24,000 responses being pulled into memory, thereby causing this 
memory issue?  What else should I look at?


INFO [ScheduledTasks:1] 2012-12-04 09:00:37,585 StatusLogger.java 
(line 57) Pool NameActive   Pending   Blocked
INFO [ScheduledTasks:1] 2012-12-04 09:00:37,695 StatusLogger.java 
(line 72) ReadStage3266 0
INFO [ScheduledTasks:1] 2012-12-04 09:00:37,696 StatusLogger.java 
(line 72) RequestResponseStage  0

Re: Read operations resulting in a write?

2012-12-17 Thread Mike


Thank you Aaron, this was very helpful.

Could it be an issue that this optimization does not really take effect 
until the memtable with the hoisted data is flushed?  In my simple 
example below, the same row is updated and multiple selects of the same 
row will result in multiple writes to the memtable. It seems it maybe 
possible (although unlikely) where, if you go from a write-mostly to a 
read-mostly scenario, you could get into a state where you are stuck 
rewriting to the same memtable, and the memtable is not flushed because 
it absorbs the over-writes.  I can foresee this especially if you are 
reading the same rows repeatedly.


I also noticed from the codepaths that if Row caching is enabled, this 
optimization will not occur.  We made some changes this weekend to make 
this column family more suitable to row-caching and enabled row-caching 
with a small cache.  Our initial results is that it seems to have 
corrected the write counts, and has increased performance quite a bit.  
However, are there any hidden gotcha's there because this optimization 
is not occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503 
mentions a "compaction is behind" problem.  Any history on that? I 
couldn't find too much information on it.


Thanks,
-Mike

On 12/16/2012 8:41 PM, aaron morton wrote:



1) Am I reading things correctly?

Yes.
If you do a read/slice by name and more than min compaction level 
nodes where read the data is re-written so that the next read uses 
fewer SSTables.


2) What is really happening here?  Essentially minor compactions can 
occur between 4 and 32 memtable flushes.  Looking through the code, 
this seems to only effect a couple types of select statements (when 
selecting a specific column on a specific key being one of them). 
During the time between these two values, every "select" statement 
will perform a write.

Yup, only for readying a row where the column names are specified.
Remember minor compaction when using SizedTiered Compaction (the 
default) works on buckets of the same size.


Imagine a row that had been around for a while and had fragments in 
more than Min Compaction Threshold sstables. Say it is 3 SSTables in 
the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 
SSTable reads. If this row is read it will get hoisted back up.


But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st 
tier it will not hoisted.


There are a few short circuits in the SliceByName read path. One of 
them is to end the search when we know that no other SSTables contain 
columns that should be considered. So if the 4 columns you read 
frequently are hoisted into the 1st bucket your reads will get handled 
by that one bucket.


It's not every select. Just those that touched more the min compaction 
sstables.



3) Is this desired behavior?  Is there something else I should be 
looking at that could be causing this behavior?

Yes.
https://issues.apache.org/jira/browse/CASSANDRA-2503

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/12/2012, at 12:58 PM, Michael Theroux <mailto:mthero...@yahoo.com>> wrote:



Hello,

We have an unusual situation that I believe I've reproduced, at least 
temporarily, in a test environment.  I also think I see where this 
issue is occurring in the code.


We have a specific column family that is under heavy read and write 
load on a nightly basis.   For the purposes of this description, I'll 
refer to this column family as "Bob".  During this nightly 
processing, sometimes Bob is under very write load, other times it is 
very heavy read load.


The application is such that when something is written to Bob, a 
write is made to one of two other tables.  We've witnessed a 
situation where the write count on Bob far outstrips the write count 
on either of the other tables, by a factor of 3->10.  This is based 
on the WriteCount available on the column family JMX MBean.  We have 
not been able to find where in our code this is happening, and we 
have gone as far as tracing our CQL calls to determine that the 
relationship between Bob and the other tables are what we expect.


I brought up a test node to experiment, and see a situation where, 
when a "select" statement is executed, a write will occur.


In my test, I perform the following (switching between nodetool and 
cqlsh):


update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush

Then, for a period of

Column Family migration/tombstones

2012-12-29 Thread Mike


Hello,

We are undergoing a change to our internal datamodel that will result in 
the eventual deletion of over a hundred million rows from a Cassandra 
column family.  From what I understand, this will result in the 
generation of tombstones, which will be cleaned up during compaction, 
after gc_grace_period time (default: 10 days).


A couple of questions:

1) As one can imagine, the index and bloom filter for this column family 
is large.  Am I correct to assume that bloom filter and index space will 
not be reduced until after gc_grace_period?


2) If I would manually run repair across a cluster, is there a process I 
can use to safely remove these tombstones before gc_grace period to free 
this memory sooner?


3) Any words of warning when undergoing this?

We are running Cassandra 1.1.2 on a 6 node cluster and a Replication 
Factor of 3.  We use LOCAL_QUORM consistency for all operations.


Thanks!
-Mike

Re: Column Family migration/tombstones

2013-01-05 Thread Mike


A couple more questions.

When these rows are deleted, tombstones will be created and stored in 
more recent sstables.  Upon compaction of sstables, and after 
gc_grace_period, I presume cassandra will have removed all traces of 
that row from disk.


However, after deleting such a large amount of information, there is no 
guarantee that Cassandra will compact these two tables together, causing 
the data to be deleted (right?).  Therefore, even after gc_grace_period, 
a large amount of space may still be used.


Is there a way, other than a major compaction, to clean up all this old 
data?  I assume a nodetool scrub will cleanup old tombstones only if 
that row is not in another sstable?


Do tombstones take up bloomfilter space after gc_grace_period?

-Mike

On 1/2/2013 6:41 PM, aaron morton wrote:

1) As one can imagine, the index and bloom filter for this column family is 
large.  Am I correct to assume that bloom filter and index space will not be 
reduced until after gc_grace_period?

Yes.


2) If I would manually run repair across a cluster, is there a process I can 
use to safely remove these tombstones before gc_grace period to free this 
memory sooner?

There is nothing to specifically purge tombstones.

You can temporarily reduce the gc_grace_seconds and then trigger compaction. 
Either by reducing the min_compaction_threshold to 2 and doing a flush. Or by 
kicking of a user defined compaction using the JMX interface.


3) Any words of warning when undergoing this?

Make sure you have a good breakfast.
(It's more general advice than Cassandra specific.)


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 30/12/2012, at 8:51 AM, Mike  wrote:


Hello,

We are undergoing a change to our internal datamodel that will result in the 
eventual deletion of over a hundred million rows from a Cassandra column 
family.  From what I understand, this will result in the generation of 
tombstones, which will be cleaned up during compaction, after gc_grace_period 
time (default: 10 days).

A couple of questions:

1) As one can imagine, the index and bloom filter for this column family is 
large.  Am I correct to assume that bloom filter and index space will not be 
reduced until after gc_grace_period?

2) If I would manually run repair across a cluster, is there a process I can 
use to safely remove these tombstones before gc_grace period to free this 
memory sooner?

3) Any words of warning when undergoing this?

We are running Cassandra 1.1.2 on a 6 node cluster and a Replication Factor of 
3.  We use LOCAL_QUORM consistency for all operations.

Thanks!
-Mike

Re: Column Family migration/tombstones

2013-01-06 Thread Mike


Thanks Aaron, I appreciate it.

It is my understanding, major compactions are not recommended because it 
will essentially create one massive SSTable that will not compact with 
any new SSTables for some time.  I can see how this might be a 
performance concern in the general case, because any read operation 
would always require multiple disk reads across multiple SSTables.  In 
addition, information in the new table will not be purged due to 
subsequent tombstones until that table can be compacted.  This might 
then require regular major compactions to be able to clear that data.  
Are there other performance considerations  that I need to keep in mind?


However, this might not be as  much of an issue in our usecase.

It just so happens, the data in this column family is changed very 
infrequently, except for deletes (as of recently, and will now occur 
over time).  In these case, I don't believe having data spread across 
the SSTables will be an issue, as either the data will have a tombstone 
(which causes cassandra to stop looking at other SSTables), or that data 
will be in one SSTable.  So I do not believe I/O will end up being an 
issue here.


What may be an issue is cleaning out old data in the SSTable that will 
exist after a major compaction.  However, this might not require major 
compactions to happen nearly as frequently as I've seen recommended 
(once every gc_grace period), or at all.   With the new design, data 
will be deleted from this table after a number of days.  Deletes again 
the remaining data after a major compaction might not get processed 
until the next major compaction, but any deletes against new data should 
be deleted normally through minor compactions.  In addition, the 
remaining data after we are complete the migration should be fairly 
small (about 500,000 skinny rows per node, including replicas).


Any other thoughts on this?
-Mike


On 1/6/2013 3:49 PM, aaron morton wrote:

When these rows are deleted, tombstones will be created and stored in more 
recent sstables.  Upon compaction of sstables, and after gc_grace_period, I 
presume cassandra will have removed all traces of that row from disk.

Yes.
When using Size Tiered compaction (the default) tombstones are purged when all 
fragments of a row are included in a compaction. So if you have rows which are 
written to for A Very Long Time(™) it can take a while for everything to get 
purged.

In the normal case though it's not a concern.


However, after deleting such a large amount of information, there is no 
guarantee that Cassandra will compact these two tables together, causing the 
data to be deleted (right?).  Therefore, even after gc_grace_period, a large 
amount of space may still be used.

In the normal case this is not really an issue.

In your case things sound a little non normal. If you will have only a few 
hundred MB's, or a few GB's, of data level in the CF I would consider running a 
major compaction on it.

Major compaction will work on all SSTables and create one big SSTable, this 
will ensure all deleted data is deleted. We normally caution agains this as the 
one new file is often very big and will not get compacted for a while. However 
if you are deleting lots-o-data it may work. (There is also an anti compaction 
script around that may be of use.)

Another alternative is to compact some of the older sstables with newer ones 
via User Defined Compaction with JMX.



Is there a way, other than a major compaction, to clean up all this old data?  
I assume a nodetool scrub will cleanup old tombstones only if that row is not 
in another sstable?

I don't think scrub (or upgradesstables) remove tombstones.


Do tombstones take up bloomfilter space after gc_grace_period?

Any row, regardless of the liveness of the columns, takes up bloom filter space 
(in -Filter.db).
Once the row is removed it will no longer take up space.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/01/2013, at 6:44 AM, Mike  wrote:


A couple more questions.

When these rows are deleted, tombstones will be created and stored in more 
recent sstables.  Upon compaction of sstables, and after gc_grace_period, I 
presume cassandra will have removed all traces of that row from disk.

However, after deleting such a large amount of information, there is no 
guarantee that Cassandra will compact these two tables together, causing the 
data to be deleted (right?).  Therefore, even after gc_grace_period, a large 
amount of space may still be used.

Is there a way, other than a major compaction, to clean up all this old data?  
I assume a nodetool scrub will cleanup old tombstones only if that row is not 
in another sstable?

Do tombstones take up bloomfilter space after gc_grace_period?

-Mike

On 1/2/2013 6:41 PM, aaron morton wrote:

1) As one can imagine, the index and bloom filter for this column family is 
large.  Am

Re: Column Family migration/tombstones

2013-01-07 Thread Mike


Thanks,

Another related question.  In the situation described below, where we 
have a row and a tombstone across more than one SSTable, and it would 
take a very long time for these SSTables to be compacted, are there two 
rows being tracked by bloomfilters (since there is a bloom filter per 
SSTable), or does Cassandra possibly do something more efficient?


To extend the example, if I delete a 1,000,000 rows, and that SSTable 
containing 1,000,000 tombstones is not compacted with the other SSTables 
containing those rows, are bloomfilters accounting for 2,000,000 rows, 
or 1,000,000?


This is more related to the current activities of deletion, as opposed 
to a major compaction (although the question is applicable to both).  As 
we delete rows, will our bloomfilters grow?


-Mike

On 1/6/2013 3:49 PM, aaron morton wrote:

When these rows are deleted, tombstones will be created and stored in more 
recent sstables.  Upon compaction of sstables, and after gc_grace_period, I 
presume cassandra will have removed all traces of that row from disk.

Yes.
When using Size Tiered compaction (the default) tombstones are purged when all 
fragments of a row are included in a compaction. So if you have rows which are 
written to for A Very Long Time(™) it can take a while for everything to get 
purged.

In the normal case though it's not a concern.


However, after deleting such a large amount of information, there is no 
guarantee that Cassandra will compact these two tables together, causing the 
data to be deleted (right?).  Therefore, even after gc_grace_period, a large 
amount of space may still be used.

In the normal case this is not really an issue.

In your case things sound a little non normal. If you will have only a few 
hundred MB's, or a few GB's, of data level in the CF I would consider running a 
major compaction on it.

Major compaction will work on all SSTables and create one big SSTable, this 
will ensure all deleted data is deleted. We normally caution agains this as the 
one new file is often very big and will not get compacted for a while. However 
if you are deleting lots-o-data it may work. (There is also an anti compaction 
script around that may be of use.)

Another alternative is to compact some of the older sstables with newer ones 
via User Defined Compaction with JMX.



Is there a way, other than a major compaction, to clean up all this old data?  
I assume a nodetool scrub will cleanup old tombstones only if that row is not 
in another sstable?

I don't think scrub (or upgradesstables) remove tombstones.


Do tombstones take up bloomfilter space after gc_grace_period?

Any row, regardless of the liveness of the columns, takes up bloom filter space 
(in -Filter.db).
Once the row is removed it will no longer take up space.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/01/2013, at 6:44 AM, Mike  wrote:


A couple more questions.

When these rows are deleted, tombstones will be created and stored in more 
recent sstables.  Upon compaction of sstables, and after gc_grace_period, I 
presume cassandra will have removed all traces of that row from disk.

However, after deleting such a large amount of information, there is no 
guarantee that Cassandra will compact these two tables together, causing the 
data to be deleted (right?).  Therefore, even after gc_grace_period, a large 
amount of space may still be used.

Is there a way, other than a major compaction, to clean up all this old data?  
I assume a nodetool scrub will cleanup old tombstones only if that row is not 
in another sstable?

Do tombstones take up bloomfilter space after gc_grace_period?

-Mike

On 1/2/2013 6:41 PM, aaron morton wrote:

1) As one can imagine, the index and bloom filter for this column family is 
large.  Am I correct to assume that bloom filter and index space will not be 
reduced until after gc_grace_period?

Yes.


2) If I would manually run repair across a cluster, is there a process I can 
use to safely remove these tombstones before gc_grace period to free this 
memory sooner?

There is nothing to specifically purge tombstones.

You can temporarily reduce the gc_grace_seconds and then trigger compaction. 
Either by reducing the min_compaction_threshold to 2 and doing a flush. Or by 
kicking of a user defined compaction using the JMX interface.


3) Any words of warning when undergoing this?

Make sure you have a good breakfast.
(It's more general advice than Cassandra specific.)


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 30/12/2012, at 8:51 AM, Mike  wrote:


Hello,

We are undergoing a change to our internal datamodel that will result in the 
eventual deletion of over a hundred million rows from a Cassandra column 
family.  From what I understand, this will result in the generation of 
tombstones, which

Cassandra 1.1.2 -> 1.1.8 upgrade

2013-01-16 Thread Mike


Hello,

We are looking to upgrade our Cassandra cluster from 1.1.2 -> 1.1.8 (or 
possibly 1.1.9 depending on timing).  It is my understanding that 
rolling upgrades of Cassandra is supported, so as we upgrade our 
cluster, we can do so one node at a time without experiencing downtime.


Has anyone had any gotchas recently that I should be aware of before 
performing this upgrade?


In order to upgrade, is the only thing that needs to change are the JAR 
files?  Can everything remain as-is?


Thanks,
-Mike

Re: Cassandra 1.1.2 -> 1.1.8 upgrade

2013-01-16 Thread Mike


Thanks for pointing that out.

Given upgradesstables can only be run on a live node, does anyone know 
if there is a danger of having this node in the cluster while this is 
being performed?  Also, can anyone confirm this only needs to be done on 
counter counter column families, or all column families (the former 
makes sense, I'm just making sure).


-Mike

On 1/16/2013 11:08 AM, Jason Wee wrote:
always check NEWS.txt for instance for cassandra 1.1.3 you need to 
run nodetool upgradesstables if your cf has counter.



On Wed, Jan 16, 2013 at 11:58 PM, Mike <mailto:mthero...@yahoo.com>> wrote:


Hello,

We are looking to upgrade our Cassandra cluster from 1.1.2 ->
1.1.8 (or possibly 1.1.9 depending on timing).  It is my
understanding that rolling upgrades of Cassandra is supported, so
as we upgrade our cluster, we can do so one node at a time without
experiencing downtime.

Has anyone had any gotchas recently that I should be aware of
before performing this upgrade?

In order to upgrade, is the only thing that needs to change are
the JAR files?  Can everything remain as-is?

Thanks,
-Mike

Cassandra flush spin?

2013-02-09 Thread Mike


Hello,

We just hit a very odd issue in our Cassandra cluster.  We are running 
Cassandra 1.1.2 in a 6 node cluster.  We use a replication factor of 3, 
and all operations utilize LOCAL_QUORUM consistency.


We noticed a large performance hit in our application's maintenance 
activities and I've been investigating.  I discovered a node in the 
cluster that was flushing a memtable like crazy.  It was flushing every 
2->3 minutes, and has been apparently doing this for days. Typically, 
during this time of day, a flush would happen every 30 minutes or so.


alldb.sh "cat /var/log/cassandra/system.log | grep \"flushing 
high-traffic column family CFS(Keyspace='open', ColumnFamily='msgs')\" | 
grep 02-08 | wc -l"

[1] 18:41:04 [SUCCESS] db-1c-1
59
[2] 18:41:05 [SUCCESS] db-1c-2
48
[3] 18:41:05 [SUCCESS] db-1a-1
1206
[4] 18:41:05 [SUCCESS] db-1d-2
54
[5] 18:41:05 [SUCCESS] db-1a-2
56
[6] 18:41:05 [SUCCESS] db-1d-1
52


I restarted the database node, and, at least for now, the problem 
appears to have stopped.


There are a number of things that don't make sense here.  We use a 
replication factor of 3, so if this was being caused by our application, 
I would have expected 3 nodes in the cluster to have issues.  Also, I 
would have expected the issue to continue once the node restarted.


Another information point of interest, and I'm wondering if its exposed 
a bug, was this node was recently converted to use ephemeral storage on 
EC2, and was restored from a snapshot.  After the restore, a nodetool 
repair was run.  However, repair was going to run into some heavy 
activity for our application, and we canceled that validation compaction 
(2 of the 3 anti-entropy sessions had completed).  The spin appears to 
have started at the start of the second session.


Any hints?

-Mike

Re: Cassandra 1.1.2 -> 1.1.8 upgrade

2013-02-09 Thread Mike


Thank you,

Another question on this topic.

Upgrading from 1.1.2->1.1.9 requires running upgradesstables, which will 
take many hours on our dataset (about 12).  For this upgrade, is it 
recommended that I:


1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and 
run a staggered upgrade of the sstables over a number of days.
2) Upgrade one node at a time, running the clustered in a mixed 
1.1.2->1.1.9 configuration for a number of days.


I would prefer #1, as with #2, streaming will not work until all the 
nodes are upgraded.


I appreciate your thoughts,
-Mike

On 1/16/2013 11:08 AM, Jason Wee wrote:
always check NEWS.txt for instance for cassandra 1.1.3 you need to 
run nodetool upgradesstables if your cf has counter.



On Wed, Jan 16, 2013 at 11:58 PM, Mike <mailto:mthero...@yahoo.com>> wrote:


Hello,

We are looking to upgrade our Cassandra cluster from 1.1.2 ->
1.1.8 (or possibly 1.1.9 depending on timing).  It is my
understanding that rolling upgrades of Cassandra is supported, so
as we upgrade our cluster, we can do so one node at a time without
experiencing downtime.

Has anyone had any gotchas recently that I should be aware of
before performing this upgrade?

In order to upgrade, is the only thing that needs to change are
the JAR files?  Can everything remain as-is?

    Thanks,
-Mike

Re: Cassandra 1.1.2 -> 1.1.8 upgrade

2013-02-11 Thread Mike

So the upgrade sstables is recommended as part of the upgrade to 1.1.3 
if you are using counter columns


Also, there was a general recommendation (in another response to my 
question) to run upgrade sstables because of:


"upgradesstables always needs to be done between majors. While 1.1.2 -> 
1.1.8 is not a major, due to an unforeseen bug in the conversion to 
microseconds you'll need to run upgradesstables."


Is this referring to: https://issues.apache.org/jira/browse/CASSANDRA-4432

Can anyone know the impact of not running upgrade sstables? Or possible 
not running it for several days?


Thanks,
-Mike

On 2/10/2013 3:27 PM, aaron morton wrote:

I would do #1.

You can play with nodetool setcompactionthroughput to speed things up, 
but beware nothing comes for free.


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 6:40 AM, Mike <mailto:mthero...@yahoo.com>> wrote:



Thank you,

Another question on this topic.

Upgrading from 1.1.2->1.1.9 requires running upgradesstables, which 
will take many hours on our dataset (about 12).  For this upgrade, is 
it recommended that I:


1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring 
and run a staggered upgrade of the sstables over a number of days.
2) Upgrade one node at a time, running the clustered in a mixed 
1.1.2->1.1.9 configuration for a number of days.


I would prefer #1, as with #2, streaming will not work until all the 
nodes are upgraded.


I appreciate your thoughts,
-Mike

On 1/16/2013 11:08 AM, Jason Wee wrote:
always check NEWS.txt for instance for cassandra 1.1.3 you need to 
run nodetool upgradesstables if your cf has counter.



On Wed, Jan 16, 2013 at 11:58 PM, Mike <mailto:mthero...@yahoo.com>> wrote:


Hello,

We are looking to upgrade our Cassandra cluster from 1.1.2 ->
1.1.8 (or possibly 1.1.9 depending on timing).  It is my
understanding that rolling upgrades of Cassandra is supported,
so as we upgrade our cluster, we can do so one node at a time
without experiencing downtime.

Has anyone had any gotchas recently that I should be aware of
before performing this upgrade?

In order to upgrade, is the only thing that needs to change are
the JAR files?  Can everything remain as-is?

Thanks,
-Mike

Size Tiered -> Leveled Compaction

2013-02-13 Thread Mike


Hello,

I'm investigating the transition of some of our column families from 
Size Tiered -> Leveled Compaction.  I believe we have some 
high-read-load column families that would benefit tremendously.


I've stood up a test DB Node to investigate the transition.  I 
successfully alter the column family, and I immediately noticed a large 
number (1000+) pending compaction tasks become available, but no 
compaction get executed.


I tried running "nodetool sstableupgrade" on the column family, and the 
compaction tasks don't move.


I also notice no changes to the size and distribution of the existing 
SSTables.


I then run a major compaction on the column family.  All pending 
compaction tasks get run, and the SSTables have a distribution that I 
would expect from LeveledCompaction (lots and lots of 10MB files).


Couple of questions:

1) Is a major compaction required to transition from size-tiered to 
leveled compaction?
2) Are major compactions as much of a concern for LeveledCompaction as 
their are for Size Tiered?


All the documentation I found concerning transitioning from Size Tiered 
to Level compaction discuss the alter table cql command, but I haven't 
found too much on what else needs to be done after the schema change.


I did these tests with Cassandra 1.1.9.

Thanks,
-Mike

Unbalanced ring after upgrade!

2013-02-14 Thread Mike


Hello,

We just upgraded from 1.1.2->1.1.9.  We utilize the byte ordered 
partitioner (we generate our own hashes).  We have not yet upgraded 
sstables.


Before the upgrade, we had a balanced ring.

After the upgrade, we see:

10.0.4.22   us-east 1a  Up Normal  77.66 GB
0.04% Token(bytes[0001])
10.0.10.23  us-east 1d  Up Normal  82.74 GB
0.04% Token(bytes[1555])
10.0.8.20   us-east 1c  Up Normal  81.79 GB
0.04% Token(bytes[2aaa])
10.0.4.23   us-east 1a  Up Normal  82.66 GB
33.84% Token(bytes[4000])
10.0.10.20  us-east 1d  Up Normal  80.21 GB
67.51% Token(bytes[5554])
10.0.8.23   us-east 1c  Up Normal  77.12 GB
99.89% Token(bytes[6aac])
10.0.4.21   us-east 1a  Up Normal  81.38 GB
66.09% Token(bytes[8000])
10.0.10.24  us-east 1d  Up Normal  83.43 GB
32.41% Token(bytes[9558])
10.0.8.21   us-east 1c  Up Normal  84.42 GB
0.04% Token(bytes[aaa8])
10.0.4.25   us-east 1a  Up Normal  80.06 GB
0.04% Token(bytes[c000])
10.0.10.21  us-east 1d  Up Normal  83.57 GB
0.04% Token(bytes[d558])
10.0.8.24   us-east 1c  Up Normal  90.74 GB
0.04% Token(bytes[eaa8])



Restarting a node essentially changes who own 99% of the ring.

Given we use an RF of 3, and LOCAL_QUORUM consistency for everything, 
and we are not seeing errors, something seems to be working correctly.  
Any idea what is going on above?  Should I be alarmed?


-Mike

Re: Unbalanced ring after upgrade!

2013-02-14 Thread Mike

Actually, doing a nodetool ring is always showing the current node as 
owning 99% of the ring


From db-1a-1:

Address DC  RackStatus State Load
Effective-Ownership Token

Token(bytes[eaa8])
10.0.4.22   us-east 1a  Up Normal  77.72 GB
99.89% Token(bytes[0001])
10.0.10.23  us-east 1d  Up Normal  82.74 GB
64.13% Token(bytes[1555])
10.0.8.20   us-east 1c  Up Normal  81.79 GB
30.55% Token(bytes[2aaa])
10.0.4.23   us-east 1a  Up Normal  82.66 GB
0.04% Token(bytes[4000])
10.0.10.20  us-east 1d  Up Normal  80.21 GB
0.04% Token(bytes[5554])
10.0.8.23   us-east 1c  Up Normal  77.07 GB
0.04% Token(bytes[6aac])
10.0.4.21   us-east 1a  Up Normal  81.38 GB
0.04% Token(bytes[8000])
10.0.10.24  us-east 1d  Up Normal  83.43 GB
0.04% Token(bytes[9558])
10.0.8.21   us-east 1c  Up Normal  84.42 GB
0.04% Token(bytes[aaa8])
10.0.4.25   us-east 1a  Up Normal  80.06 GB
0.04% Token(bytes[c000])
10.0.10.21  us-east 1d  Up Normal  83.49 GB
35.80% Token(bytes[d558])
10.0.8.24   us-east 1c  Up Normal  90.72 GB
69.37% Token(bytes[eaa8])



From db-1c-3:

Address DC  RackStatus State Load
Effective-Ownership Token

Token(bytes[eaa8])
10.0.4.22   us-east 1a  Up Normal  77.72 GB
0.04% Token(bytes[0001])
10.0.10.23  us-east 1d  Up Normal  82.78 GB
0.04% Token(bytes[1555])
10.0.8.20   us-east 1c  Up Normal  81.79 GB
0.04% Token(bytes[2aaa])
10.0.4.23   us-east 1a  Up Normal  82.66 GB
33.84% Token(bytes[4000])
10.0.10.20  us-east 1d  Up Normal  80.21 GB
67.51% Token(bytes[5554])
10.0.8.23   us-east 1c  Up Normal  77.07 GB
99.89% Token(bytes[6aac])
10.0.4.21   us-east 1a  Up Normal  81.38 GB
66.09% Token(bytes[8000])
10.0.10.24  us-east 1d  Up Normal  83.43 GB
32.41% Token(bytes[9558])
10.0.8.21   us-east 1c  Up Normal  84.42 GB
0.04% Token(bytes[aaa8])
10.0.4.25   us-east 1a  Up Normal  80.06 GB
0.04% Token(bytes[c000])
10.0.10.21  us-east 1d  Up Normal  83.49 GB
0.04% Token(bytes[d558])
10.0.8.24   us-east 1c  Up Normal  90.72 GB
0.04% Token(bytes[eaa8])


Any help would be appreciated, as if something is going drastically 
wrong we need to go back to backups and revert back to 1.1.2.


Thanks,
-Mike

On 2/14/2013 8:32 AM, Mike wrote:

Hello,

We just upgraded from 1.1.2->1.1.9.  We utilize the byte ordered 
partitioner (we generate our own hashes).  We have not yet upgraded 
sstables.


Before the upgrade, we had a balanced ring.

After the upgrade, we see:

10.0.4.22   us-east 1a  Up Normal  77.66 GB
0.04% Token(bytes[0001])
10.0.10.23  us-east 1d  Up Normal  82.74 GB
0.04% Token(bytes[1555])
10.0.8.20   us-east 1c  Up Normal  81.79 GB
0.04% Token(bytes[2aaa])
10.0.4.23   us-east 1a  Up Normal  82.66 GB
33.84% Token(bytes[4000])
10.0.10.20  us-east 1d  Up Normal  80.21 GB
67.51% Token(bytes[5554])
10.0.8.23   us-east 1c  Up Normal  77.12 GB
99.89% Token(bytes[6aac])
10.0.4.21   us-east 1a  Up Normal  81.38 GB
66.09% Token(bytes[8000])
10.0.10.24  us-east 1d  Up Normal  83.43 GB
32.41% Token(bytes[9558])
10.0.8.21   us-east 1c  Up Normal  84.42 GB
0.04% Token(bytes[aaa8])
10.0.4.25   us-e

Re: Deletion consistency

2013-02-15 Thread Mike

If you increase the number of nodes to 3, with an RF of 3, then you 
should be able to read/delete utilizing a quorum consistency level, 
which I believe will help here.  Also, make sure the time of your 
servers are in sync, utilizing NTP, as drifting time between you client 
and server could cause updates to be mistakenly dropped for being old.


Also, make sure you are running with a gc_grace period that is high 
enough.  The default is 10 days.


Hope this helps,
-Mike

On 2/15/2013 1:13 PM, Víctor Hugo Oliveira Molinar wrote:

hello everyone!

I have a column family filled with event objects which need to be 
processed by query threads.
Once each thread query for those objects(spread among columns bellow a 
row), it performs a delete operation for each object in cassandra.

It's done in order to ensure that these events wont be processed again.
Some tests has showed me that it works, but sometimes i'm not getting 
those events deleted. I checked it through cassandra-cli,etc.


So, reading it (http://wiki.apache.org/cassandra/DistributedDeletes) I 
came to a conclusion that I may be reading old data.

My cluster is currently configured as: 2 nodes, RF1, CL 1.
In that case, what should I do?

- Increase the consistency level for the write operations( in that 
case, the deletions ). In order to ensure that those deletions are 
stored in all nodes.

or
- Increase the consistency level for the read operations. In order to 
ensure that I'm reading only those yet processed events(deleted).


?

-
Thanks in advance

Re: Size Tiered -> Leveled Compaction

2013-02-16 Thread Mike

Another piece of information that would be useful is advice on how to 
properly set the SSTable size for your usecase.  I understand the 
default is 5MB, a lot of examples show the use of 10MB, and I've seen 
cases where people have set is as high as 200MB.


Any information is appreciated,
-Mike

On 2/14/2013 4:10 PM, Michael Theroux wrote:
BTW, when I say "major compaction", I mean running the "nodetool 
compact" command (which does a major compaction for Sized Tiered 
Compaction).  I didn't see the distribution of SSTables I expected 
until I ran that command, in the steps I described below.


-Mike

On Feb 14, 2013, at 3:51 PM, Wei Zhu wrote:


I haven't tried to switch compaction strategy. We started with LCS.

For us, after massive data imports (5000 w/seconds for 6 days), the 
first repair is painful since there is quite some data inconsistency. 
For 150G nodes, repair brought in about 30 G and created thousands of 
pending compactions. It took almost a day to clear those. Just be 
prepared LCS is really slow in 1.1.X. System performance degrades 
during that time since reads could go to more SSTable, we see 20 
SSTable lookup for one read.. (We tried everything we can and 
couldn't speed it up. I think it's single threaded and it's not 
recommended to turn on multithread compaction. We even tried that, it 
didn't help )There is parallel LCS in 1.2 which is supposed to 
alleviate the pain. Haven't upgraded yet, hope it works:)


http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2


Since our cluster is not write intensive, only 100 w/seconds. I don't 
see any pending compactions during regular operation.


One thing worth mentioning is the size of the SSTable, default is 5M 
which is kind of small for 200G (all in one CF) data set, and we are 
on SSD.  It more than  150K files in one directory. (200G/5M = 40K 
SSTable and each SSTable creates 4 files on disk)  You might want to 
watch that and decide the SSTable size.


By the way, there is no concept of Major compaction for LCS. Just for 
fun, you can look at a file called $CFName.json in your data 
directory and it tells you the SSTable distribution among different 
levels.


-Wei


*From:* Charles Brophy mailto:cbro...@zulily.com>>
*To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org>
*Sent:* Thursday, February 14, 2013 8:29 AM
*Subject:* Re: Size Tiered -> Leveled Compaction

I second these questions: we've been looking into changing some of 
our CFs to use leveled compaction as well. If anybody here has the 
wisdom to answer them it would be of wonderful help.


Thanks
Charles

On Wed, Feb 13, 2013 at 7:50 AM, Mike <mailto:mthero...@yahoo.com>> wrote:


Hello,

I'm investigating the transition of some of our column families
from Size Tiered -> Leveled Compaction.  I believe we have some
high-read-load column families that would benefit tremendously.

I've stood up a test DB Node to investigate the transition.  I
successfully alter the column family, and I immediately noticed a
large number (1000+) pending compaction tasks become available,
but no compaction get executed.

I tried running "nodetool sstableupgrade" on the column family,
and the compaction tasks don't move.

I also notice no changes to the size and distribution of the
existing SSTables.

I then run a major compaction on the column family.  All pending
compaction tasks get run, and the SSTables have a distribution
that I would expect from LeveledCompaction (lots and lots of 10MB
files).

Couple of questions:

1) Is a major compaction required to transition from size-tiered
to leveled compaction?
2) Are major compactions as much of a concern for
LeveledCompaction as their are for Size Tiered?

All the documentation I found concerning transitioning from Size
Tiered to Level compaction discuss the alter table cql command,
but I haven't found too much on what else needs to be done after
the schema change.

I did these tests with Cassandra 1.1.9.

Thanks,
-Mike

Re: Size Tiered -> Leveled Compaction

2013-02-17 Thread Mike


Hello Wei,

First thanks for this response.

Out of curiosity, what SSTable size did you choose for your usecase, and 
what made you decide on that number?


Thanks,
-Mike

On 2/14/2013 3:51 PM, Wei Zhu wrote:

I haven't tried to switch compaction strategy. We started with LCS.

For us, after massive data imports (5000 w/seconds for 6 days), the 
first repair is painful since there is quite some data inconsistency. 
For 150G nodes, repair brought in about 30 G and created thousands of 
pending compactions. It took almost a day to clear those. Just be 
prepared LCS is really slow in 1.1.X. System performance degrades 
during that time since reads could go to more SSTable, we see 20 
SSTable lookup for one read.. (We tried everything we can and couldn't 
speed it up. I think it's single threaded and it's not recommended 
to turn on multithread compaction. We even tried that, it didn't help 
)There is parallel LCS in 1.2 which is supposed to alleviate the pain. 
Haven't upgraded yet, hope it works:)


http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2


Since our cluster is not write intensive, only 100 w/seconds. I don't 
see any pending compactions during regular operation.


One thing worth mentioning is the size of the SSTable, default is 5M 
which is kind of small for 200G (all in one CF) data set, and we are 
on SSD.  It more than  150K files in one directory. (200G/5M = 40K 
SSTable and each SSTable creates 4 files on disk)  You might want to 
watch that and decide the SSTable size.


By the way, there is no concept of Major compaction for LCS. Just for 
fun, you can look at a file called $CFName.json in your data directory 
and it tells you the SSTable distribution among different levels.


-Wei


*From:* Charles Brophy 
*To:* user@cassandra.apache.org
*Sent:* Thursday, February 14, 2013 8:29 AM
*Subject:* Re: Size Tiered -> Leveled Compaction

I second these questions: we've been looking into changing some of our 
CFs to use leveled compaction as well. If anybody here has the wisdom 
to answer them it would be of wonderful help.


Thanks
Charles

On Wed, Feb 13, 2013 at 7:50 AM, Mike <mailto:mthero...@yahoo.com>> wrote:


Hello,

I'm investigating the transition of some of our column families
from Size Tiered -> Leveled Compaction.  I believe we have some
high-read-load column families that would benefit tremendously.

I've stood up a test DB Node to investigate the transition.  I
successfully alter the column family, and I immediately noticed a
large number (1000+) pending compaction tasks become available,
but no compaction get executed.

I tried running "nodetool sstableupgrade" on the column family,
and the compaction tasks don't move.

I also notice no changes to the size and distribution of the
existing SSTables.

I then run a major compaction on the column family.  All pending
compaction tasks get run, and the SSTables have a distribution
that I would expect from LeveledCompaction (lots and lots of 10MB
files).

Couple of questions:

1) Is a major compaction required to transition from size-tiered
to leveled compaction?
2) Are major compactions as much of a concern for
LeveledCompaction as their are for Size Tiered?

All the documentation I found concerning transitioning from Size
Tiered to Level compaction discuss the alter table cql command,
but I haven't found too much on what else needs to be done after
    the schema change.

I did these tests with Cassandra 1.1.9.

Thanks,
-Mike

Re: Size Tiered -> Leveled Compaction

2013-02-22 Thread Mike


Hello,

Still doing research before we potentially move one of our column 
families from Size Tiered->Leveled compaction this weekend.  I was doing 
some research around some of the bugs that were filed against leveled 
compaction in Cassandra and I found this:


https://issues.apache.org/jira/browse/CASSANDRA-4644

The bug mentions:

"You need to run the offline scrub (bin/sstablescrub) to fix the sstable 
overlapping problem from early 1.1 releases. (Running with -m to just 
check for overlaps between sstables should be fine, since you already 
scrubbed online which will catch out-of-order within an sstable.)"


We recently upgraded from 1.1.2 to 1.1.9.

Does anyone know if an offline scrub is recommended to be performed when 
switching from STCS->LCS after upgrading from 1.1.2?


Any insight would be appreciated,
Thanks,
-Mike

On 2/17/2013 8:57 PM, Wei Zhu wrote:

We doubled the SStable size to 10M. It still generates a lot of SSTable and we 
don't see much difference of the read latency.  We are able to finish the 
compactions after repair within serveral hours. We will increase the SSTable 
size again if we feel the number of SSTable hurts the performance.

- Original Message -
From: "Mike" 
To: user@cassandra.apache.org
Sent: Sunday, February 17, 2013 4:50:40 AM
Subject: Re: Size Tiered -> Leveled Compaction


Hello Wei,

First thanks for this response.

Out of curiosity, what SSTable size did you choose for your usecase, and what 
made you decide on that number?

Thanks,
-Mike

On 2/14/2013 3:51 PM, Wei Zhu wrote:




I haven't tried to switch compaction strategy. We started with LCS.


For us, after massive data imports (5000 w/seconds for 6 days), the first 
repair is painful since there is quite some data inconsistency. For 150G nodes, 
repair brought in about 30 G and created thousands of pending compactions. It 
took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. 
System performance degrades during that time since reads could go to more 
SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can 
and couldn't speed it up. I think it's single threaded and it's not 
recommended to turn on multithread compaction. We even tried that, it didn't 
help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. 
Haven't upgraded yet, hope it works:)


http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2





Since our cluster is not write intensive, only 100 w/seconds. I don't see any 
pending compactions during regular operation.


One thing worth mentioning is the size of the SSTable, default is 5M which is 
kind of small for 200G (all in one CF) data set, and we are on SSD. It more 
than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable 
creates 4 files on disk) You might want to watch that and decide the SSTable 
size.


By the way, there is no concept of Major compaction for LCS. Just for fun, you 
can look at a file called $CFName.json in your data directory and it tells you 
the SSTable distribution among different levels.


-Wei





From: Charles Brophy 
To: user@cassandra.apache.org
Sent: Thursday, February 14, 2013 8:29 AM
Subject: Re: Size Tiered -> Leveled Compaction


I second these questions: we've been looking into changing some of our CFs to 
use leveled compaction as well. If anybody here has the wisdom to answer them 
it would be of wonderful help.


Thanks
Charles


On Wed, Feb 13, 2013 at 7:50 AM, Mike < mthero...@yahoo.com > wrote:


Hello,

I'm investigating the transition of some of our column families from Size Tiered 
-> Leveled Compaction. I believe we have some high-read-load column families 
that would benefit tremendously.

I've stood up a test DB Node to investigate the transition. I successfully 
alter the column family, and I immediately noticed a large number (1000+) 
pending compaction tasks become available, but no compaction get executed.

I tried running "nodetool sstableupgrade" on the column family, and the 
compaction tasks don't move.

I also notice no changes to the size and distribution of the existing SSTables.

I then run a major compaction on the column family. All pending compaction 
tasks get run, and the SSTables have a distribution that I would expect from 
LeveledCompaction (lots and lots of 10MB files).

Couple of questions:

1) Is a major compaction required to transition from size-tiered to leveled 
compaction?
2) Are major compactions as much of a concern for LeveledCompaction as their 
are for Size Tiered?

All the documentation I found concerning transitioning from Size Tiered to 
Level compaction discuss the alter table cql command, but I haven't found too 
much on what else needs to be done after the schema change.

I did these tests with Cassandra 1.1.9.

Thanks,
-Mike

data type is object when metric instrument using Gauge?

2014-08-02 Thread mike


Dear All

  We are trying to monitor Cassandra using JMX. The monitoring tool we 
are using works fine for meters, However, if the metrcis are collected 
using gauge, the data type is object, then, our tool treat it as a 
string instead of a double. for example


org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Capacity

The Type of Attribute (Value) is java.lang.Object

is it possible to implement the datatype of gauge as numeric types 
instead of object, or other way around for example using metric 
reporter...etc?


Thanks a lot for any suggestion!

Best Regard!
  Mike

Re: Issue with leveled compaction and data migration

2013-09-23 Thread Mike

Thanks for the response Rob,

And yes, the relevel helped the bloom filter issue quite a bit, although it 
took a couple of days for the relevel to complete on a single node (so if 
anyone tried this, be prepared)

-Mike

Sent from my iPhone

On Sep 23, 2013, at 6:34 PM, Robert Coli  wrote:

> On Fri, Sep 13, 2013 at 4:27 AM, Michael Theroux  wrote:
>> Another question on [the topic of row fragmentation when old rows get a 
>> large append to their "end" resulting in larger-than-expected bloom filters].
>> 
>> Would forcing the table to relevel help this situation?  I believe the 
>> process to do this on 1.1.X would be to stop cassandra, remove .json file, 
>> and restart cassandra.  Is this true?
> 
> I believe forcing a re-level would help, because each row would appear in 
> fewer sstables and therefore fewer bloom filters.
> 
> Yes, that is the process to re-level on Cassandra 1.1.x.
>  
> =Rob

high latency on one node after replacement

2018-03-27 Thread Mike Torra

Hi There -

I have noticed an issue where I consistently see high p999 read latency on
a node for a few hours after replacing the node. Before replacing the node,
the p999 read latency is ~30ms, but after it increases to 1-5s. I am
running C* 3.11.2 in EC2.

I am testing out using EBS snapshots of the /data disk as a backup, so that
I can replace nodes without having to fully bootstrap the replacement. This
seems to work ok, except for the latency issue. Some things I have noticed:

- `nodetool netstats` doesn't show any 'Completed' Large Messages, only
'Dropped', while this is going on. There are only a few of these.
- the logs show warnings like this:

WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655
NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s
with average duration of 235.88ms, 86 have exceeded the configured commit
interval by an average of 113.66ms
  and I can see some slow queries in debug.log, but I can't figure out what
is causing it
- gc seems normal

Could this have something to do with starting the node with the EBS
snapshot of the /data directory? My first thought was that this is related
to the EBS volumes, but it seems too consistent to be actually caused by
that. The problem is consistent across multiple replacements, and multiple
EC2 regions.

I appreciate any suggestions!

- Mike

Re: high latency on one node after replacement

2018-03-27 Thread Mike Torra

thanks for pointing that out, i just found it too :) i overlooked this

On Tue, Mar 27, 2018 at 3:44 PM, Voytek Jarnot 
wrote:

> Have you ruled out EBS snapshot initialization issues (
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html)?
>
> On Tue, Mar 27, 2018 at 2:24 PM, Mike Torra  wrote:
>
>> Hi There -
>>
>> I have noticed an issue where I consistently see high p999 read latency
>> on a node for a few hours after replacing the node. Before replacing the
>> node, the p999 read latency is ~30ms, but after it increases to 1-5s. I am
>> running C* 3.11.2 in EC2.
>>
>> I am testing out using EBS snapshots of the /data disk as a backup, so
>> that I can replace nodes without having to fully bootstrap the replacement.
>> This seems to work ok, except for the latency issue. Some things I have
>> noticed:
>>
>> - `nodetool netstats` doesn't show any 'Completed' Large Messages, only
>> 'Dropped', while this is going on. There are only a few of these.
>> - the logs show warnings like this:
>>
>> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655
>> NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s
>> with average duration of 235.88ms, 86 have exceeded the configured commit
>> interval by an average of 113.66ms
>>   and I can see some slow queries in debug.log, but I can't figure out
>> what is causing it
>> - gc seems normal
>>
>> Could this have something to do with starting the node with the EBS
>> snapshot of the /data directory? My first thought was that this is related
>> to the EBS volumes, but it seems too consistent to be actually caused by
>> that. The problem is consistent across multiple replacements, and multiple
>> EC2 regions.
>>
>> I appreciate any suggestions!
>>
>> - Mike
>>
>
>

nodejs client can't connect to two nodes with different private ip addresses in different dcs

2018-11-29 Thread Mike Torra

Hi Guys -

I recently ran in to a problem (for the 2nd time) where my nodejs app for
some reason refuses to connect to one node in my C* cluster. I noticed that
in both cases, the node that was not receiving any client connections had
the same private ip as another node in the cluster, but in a different
datacenter. That prompted me to poke around the client code a bit, and I
think I found the problem:

https://github.com/datastax/nodejs-driver/blob/master/lib/control-connection.js#L647

Since `endpoint` is the `rpc_address` of the node, if I'm reading this
right, the client will silently ignore other nodes that happen to have the
same private ip.

The first time I had this problem, I simply removed the node from the
cluster and added a new one, with a different private ip. Now that I
suspect I have found the problem, I'm wondering if there is a simpler
solution.

I realize this is specific to the nodejs client, but I thought I'd see if
anyone else here has ran in to this. It would be great if I could get the
nodejs client to ignore nodes in the remote data centers. I've already
tried adding this to the client config, but it doesn't resolve the problem:
```
pooling: {
coreConnectionsPerHost: {
[distance.local]: 2,
[distance.remote]: 0
}
}
```

Any suggestions?

- Mike

TWCS sstables not dropping even though all data is expired

2019-04-30 Thread Mike Torra

Hello -

I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few
months ago I started noticing disk usage on some nodes increasing
consistently. At first I solved the problem by destroying the nodes and
rebuilding them, but the problem returns.

I did some more investigation recently, and this is what I found:
- I narrowed the problem down to a CF that uses TWCS, by simply looking at
disk space usage
- in each region, 3 nodes have this problem of growing disk space (matches
replication factor)
- on each node, I tracked down the problem to a particular SSTable using
`sstableexpiredblockers`
- in the SSTable, using `sstabledump`, I found a row that does not have a
ttl like the other rows, and appears to be from someone else on the team
testing something and forgetting to include a ttl
- all other rows show "expired: true" except this one, hence my suspicion
- when I query for that particular partition key, I get no results
- I tried deleting the row anyways, but that didn't seem to change anything
- I also tried `nodetool scrub`, but that didn't help either

Would this rogue row without a ttl explain the problem? If so, why? If not,
does anyone have any other ideas? Why does the row show in `sstabledump`
but not when I query for it?

I appreciate any help or suggestions!

- Mike

Re: TWCS sstables not dropping even though all data is expired

2019-05-02 Thread Mike Torra

I'm pretty stumped by this, so here is some more detail if it helps.

Here is what the suspicious partition looks like in the `sstabledump`
output (some pii etc redacted):
```
{
"partition" : {
  "key" : [ "some_user_id_value", "user_id", "demo-test" ],
  "position" : 210
},
"rows" : [
  {
"type" : "row",
"position" : 1132,
"clustering" : [ "2019-01-22 15:27:45.000Z" ],
"liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" },
"cells" : [
  { "some": "data" }
]
  }
]
  }
```

And here is what every other partition looks like:
```
{
"partition" : {
  "key" : [ "some_other_user_id", "user_id", "some_site_id" ],
  "position" : 1133
},
"rows" : [
  {
"type" : "row",
"position" : 1234,
"clustering" : [ "2019-01-22 17:59:35.547Z" ],
"liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl" :
86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true },
"cells" : [
  { "name" : "activity_data", "deletion_info" : {
"local_delete_time" : "2019-01-22T17:59:35Z" }
  }
]
  }
]
  }
```

As expected, almost all of the data except this one suspicious partition
has a ttl and is already expired. But if a partition isn't expired and I
see it in the sstable, why wouldn't I see it executing a CQL query against
the CF? Why would this sstable be preventing so many other sstable's from
getting cleaned up?

On Tue, Apr 30, 2019 at 12:34 PM Mike Torra  wrote:

> Hello -
>
> I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few
> months ago I started noticing disk usage on some nodes increasing
> consistently. At first I solved the problem by destroying the nodes and
> rebuilding them, but the problem returns.
>
> I did some more investigation recently, and this is what I found:
> - I narrowed the problem down to a CF that uses TWCS, by simply looking at
> disk space usage
> - in each region, 3 nodes have this problem of growing disk space (matches
> replication factor)
> - on each node, I tracked down the problem to a particular SSTable using
> `sstableexpiredblockers`
> - in the SSTable, using `sstabledump`, I found a row that does not have a
> ttl like the other rows, and appears to be from someone else on the team
> testing something and forgetting to include a ttl
> - all other rows show "expired: true" except this one, hence my suspicion
> - when I query for that particular partition key, I get no results
> - I tried deleting the row anyways, but that didn't seem to change anything
> - I also tried `nodetool scrub`, but that didn't help either
>
> Would this rogue row without a ttl explain the problem? If so, why? If
> not, does anyone have any other ideas? Why does the row show in
> `sstabledump` but not when I query for it?
>
> I appreciate any help or suggestions!
>
> - Mike
>

Re: TWCS sstables not dropping even though all data is expired

2019-05-03 Thread Mike Torra

This does indeed seem to be a problem of overlapping sstables, but I don't
understand why the data (and number of sstables) just continues to grow
indefinitely. I also don't understand why this problem is only appearing on
some nodes. Is it just a coincidence that the one rogue test row without a
ttl is at the 'root' sstable causing the problem (ie, from the output of
`sstableexpiredblockers`)?

Running a full compaction via `nodetool compact` reclaims the disk space,
but I'd like to figure out why this happened and prevent it. Understanding
why this problem would be isolated the way it is (ie only one CF even
though I have a few others that share a very similar schema, and only some
nodes) seems like it will help me prevent it.


On Thu, May 2, 2019 at 1:00 PM Paul Chandler  wrote:

> Hi Mike,
>
> It sounds like that record may have been deleted, if that is the case then
> it would still be shown in this sstable, but the deleted tombstone record
> would be in a later sstable. You can use nodetool getsstables to work out
> which sstables contain the data.
>
> I recommend reading The Last Pickle post on this:
> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html the sections
> towards the bottom of this post may well explain why the sstable is not
> being deleted.
>
> Thanks
>
> Paul
> www.redshots.com
>
> On 2 May 2019, at 16:08, Mike Torra  wrote:
>
> I'm pretty stumped by this, so here is some more detail if it helps.
>
> Here is what the suspicious partition looks like in the `sstabledump`
> output (some pii etc redacted):
> ```
> {
> "partition" : {
>   "key" : [ "some_user_id_value", "user_id", "demo-test" ],
>   "position" : 210
> },
> "rows" : [
>   {
> "type" : "row",
> "position" : 1132,
> "clustering" : [ "2019-01-22 15:27:45.000Z" ],
> "liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" },
> "cells" : [
>   { "some": "data" }
> ]
>   }
> ]
>   }
> ```
>
> And here is what every other partition looks like:
> ```
> {
> "partition" : {
>   "key" : [ "some_other_user_id", "user_id", "some_site_id" ],
>   "position" : 1133
> },
> "rows" : [
>   {
> "type" : "row",
> "position" : 1234,
> "clustering" : [ "2019-01-22 17:59:35.547Z" ],
> "liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl" :
> 86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true },
> "cells" : [
>   { "name" : "activity_data", "deletion_info" : {
> "local_delete_time" : "2019-01-22T17:59:35Z" }
>   }
> ]
>   }
> ]
>   }
> ```
>
> As expected, almost all of the data except this one suspicious partition
> has a ttl and is already expired. But if a partition isn't expired and I
> see it in the sstable, why wouldn't I see it executing a CQL query against
> the CF? Why would this sstable be preventing so many other sstable's from
> getting cleaned up?
>
> On Tue, Apr 30, 2019 at 12:34 PM Mike Torra  wrote:
>
>> Hello -
>>
>> I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few
>> months ago I started noticing disk usage on some nodes increasing
>> consistently. At first I solved the problem by destroying the nodes and
>> rebuilding them, but the problem returns.
>>
>> I did some more investigation recently, and this is what I found:
>> - I narrowed the problem down to a CF that uses TWCS, by simply looking
>> at disk space usage
>> - in each region, 3 nodes have this problem of growing disk space
>> (matches replication factor)
>> - on each node, I tracked down the problem to a particular SSTable using
>> `sstableexpiredblockers`
>> - in the SSTable, using `sstabledump`, I found a row that does not have a
>> ttl like the other rows, and appears to be from someone else on the team
>> testing something and forgetting to include a ttl
>> - all other rows show "expired: true" except this one, hence my suspicion
>> - when I query for that particular partition key, I get no results
>> - I tried deleting the row anyways, but that didn't seem to change
>> anything
>> - I also tried `nodetool scrub`, but that didn't help either
>>
>> Would this rogue row without a ttl explain the problem? If so, why? If
>> not, does anyone have any other ideas? Why does the row show in
>> `sstabledump` but not when I query for it?
>>
>> I appreciate any help or suggestions!
>>
>> - Mike
>>
>
>

Re: TWCS sstables not dropping even though all data is expired

2019-05-03 Thread Mike Torra

Thx for the help Paul - there are definitely some details here I still
don't fully understand, but this helped me resolve the problem and know
what to look for in the future :)

On Fri, May 3, 2019 at 12:44 PM Paul Chandler  wrote:

> Hi Mike,
>
> For TWCS the sstable can only be deleted when all the data has expired in
> that sstable, but you had a record without a ttl in it, so that sstable
> could never be deleted.
>
> That bit is straight forward, the next bit I remember reading somewhere
> but can’t find it at the moment to confirm my thinking.
>
> An sstable can only be deleted if it is the earliest sstable. I think this
> is due to the fact that deleting later sstables may expose old versions of
> the data stored in the stuck sstable which had been superseded. For
> example, if there was a tombstone in a later sstable for the non TTLed
> record causing the problem in this instance. Then deleting that sstable
> would cause that deleted data to reappear. (Someone please correct me if I
> have this wrong)
>
> Because sstables in different time buckets are never compacted together,
> this problem only goes away when you did the major compaction.
>
> This would happen on all replicas of the data, hence the reason you this
> problem on 3 nodes.
>
> Thanks
>
> Paul
> www.redshots.com
>
> On 3 May 2019, at 15:35, Mike Torra  wrote:
>
> This does indeed seem to be a problem of overlapping sstables, but I don't
> understand why the data (and number of sstables) just continues to grow
> indefinitely. I also don't understand why this problem is only appearing on
> some nodes. Is it just a coincidence that the one rogue test row without a
> ttl is at the 'root' sstable causing the problem (ie, from the output of
> `sstableexpiredblockers`)?
>
> Running a full compaction via `nodetool compact` reclaims the disk space,
> but I'd like to figure out why this happened and prevent it. Understanding
> why this problem would be isolated the way it is (ie only one CF even
> though I have a few others that share a very similar schema, and only some
> nodes) seems like it will help me prevent it.
>
>
> On Thu, May 2, 2019 at 1:00 PM Paul Chandler  wrote:
>
>> Hi Mike,
>>
>> It sounds like that record may have been deleted, if that is the case
>> then it would still be shown in this sstable, but the deleted tombstone
>> record would be in a later sstable. You can use nodetool getsstables to
>> work out which sstables contain the data.
>>
>> I recommend reading The Last Pickle post on this:
>> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html the sections
>> towards the bottom of this post may well explain why the sstable is not
>> being deleted.
>>
>> Thanks
>>
>> Paul
>> www.redshots.com
>>
>> On 2 May 2019, at 16:08, Mike Torra 
>> wrote:
>>
>> I'm pretty stumped by this, so here is some more detail if it helps.
>>
>> Here is what the suspicious partition looks like in the `sstabledump`
>> output (some pii etc redacted):
>> ```
>> {
>> "partition" : {
>>   "key" : [ "some_user_id_value", "user_id", "demo-test" ],
>>   "position" : 210
>> },
>> "rows" : [
>>   {
>> "type" : "row",
>> "position" : 1132,
>> "clustering" : [ "2019-01-22 15:27:45.000Z" ],
>> "liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" },
>> "cells" : [
>>   { "some": "data" }
>> ]
>>   }
>> ]
>>   }
>> ```
>>
>> And here is what every other partition looks like:
>> ```
>> {
>> "partition" : {
>>   "key" : [ "some_other_user_id", "user_id", "some_site_id" ],
>>   "position" : 1133
>> },
>> "rows" : [
>>   {
>> "type" : "row",
>> "position" : 1234,
>> "clustering" : [ "2019-01-22 17:59:35.547Z" ],
>> "liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl"
>> : 86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true },
>> "cells" : [
>>   { "name" : "activity_data", "deletion_info" : {
>> "local_delete_time" : "2019-01

Re: TWCS sstables not dropping even though all data is expired

2019-05-06 Thread Mike Torra

Compaction settings:
```
compaction = {'class':
'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
'compaction_window_size': '6', 'compaction_window_unit': 'HOURS',
'max_threshold': '32', 'min_threshold': '4'}
```
read_repair_chance is 0, and I don't do any repairs because (normally)
everything has a ttl. It does seem like Jeff is right that a manual
insert/update without a ttl is what caused this, so I know how to resolve
it and prevent it from happening again.

Thx again for all the help guys, I appreciate it!


On Fri, May 3, 2019 at 11:21 PM Jeff Jirsa  wrote:

> Repairs work fine with TWCS, but having a non-expiring row will prevent
> tombstones in newer sstables from being purged
>
> I suspect someone did a manual insert/update without a ttl and that
> effectively blocks all other expiring cells from being purged.
>
> --
> Jeff Jirsa
>
>
> On May 3, 2019, at 7:57 PM, Nick Hatfield 
> wrote:
>
> Hi Mike,
>
>
>
> If you will, share your compaction settings. More than likely, your issue
> is from 1 of 2 reasons:
> 1. You have read repair chance set to anything other than 0
>
> 2. You’re running repairs on the TWCS CF
>
>
>
> Or both….
>
>
>
> *From:* Mike Torra [mailto:mto...@salesforce.com.INVALID
> ]
> *Sent:* Friday, May 03, 2019 3:00 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: TWCS sstables not dropping even though all data is expired
>
>
>
> Thx for the help Paul - there are definitely some details here I still
> don't fully understand, but this helped me resolve the problem and know
> what to look for in the future :)
>
>
>
> On Fri, May 3, 2019 at 12:44 PM Paul Chandler  wrote:
>
> Hi Mike,
>
>
>
> For TWCS the sstable can only be deleted when all the data has expired in
> that sstable, but you had a record without a ttl in it, so that sstable
> could never be deleted.
>
>
>
> That bit is straight forward, the next bit I remember reading somewhere
> but can’t find it at the moment to confirm my thinking.
>
>
>
> An sstable can only be deleted if it is the earliest sstable. I think this
> is due to the fact that deleting later sstables may expose old versions of
> the data stored in the stuck sstable which had been superseded. For
> example, if there was a tombstone in a later sstable for the non TTLed
> record causing the problem in this instance. Then deleting that sstable
> would cause that deleted data to reappear. (Someone please correct me if I
> have this wrong)
>
>
>
> Because sstables in different time buckets are never compacted together,
> this problem only goes away when you did the major compaction.
>
>
>
> This would happen on all replicas of the data, hence the reason you this
> problem on 3 nodes.
>
>
>
> Thanks
>
>
>
> Paul
>
> www.redshots.com
>
>
>
> On 3 May 2019, at 15:35, Mike Torra  wrote:
>
>
>
> This does indeed seem to be a problem of overlapping sstables, but I don't
> understand why the data (and number of sstables) just continues to grow
> indefinitely. I also don't understand why this problem is only appearing on
> some nodes. Is it just a coincidence that the one rogue test row without a
> ttl is at the 'root' sstable causing the problem (ie, from the output of
> `sstableexpiredblockers`)?
>
>
>
> Running a full compaction via `nodetool compact` reclaims the disk space,
> but I'd like to figure out why this happened and prevent it. Understanding
> why this problem would be isolated the way it is (ie only one CF even
> though I have a few others that share a very similar schema, and only some
> nodes) seems like it will help me prevent it.
>
>
>
>
>
> On Thu, May 2, 2019 at 1:00 PM Paul Chandler  wrote:
>
> Hi Mike,
>
>
>
> It sounds like that record may have been deleted, if that is the case then
> it would still be shown in this sstable, but the deleted tombstone record
> would be in a later sstable. You can use nodetool getsstables to work out
> which sstables contain the data.
>
>
>
> I recommend reading The Last Pickle post on this:
> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html the sections
> towards the bottom of this post may well explain why the sstable is not
> being deleted.
>
>
>
> Thanks
>
>
>
> Paul
>
> www.redshots.com
>
>
>
> On 2 May 2019, at 16:08, Mike Torra  wrote:
>
>
>
> I'm pretty stumped by this, so here is some more detail if it helps.
>
>
>
> Here is what the suspicious partition looks like in the `sstabledump`
> ou

Re: TWCS sstables not dropping even though all data is expired

2019-05-07 Thread Mike Torra

Thx for the tips Jeff, I'm definitely going to start using table level TTLs
(not sure why I didn't before), and I'll take a look at the tombstone
compaction subproperties

On Mon, May 6, 2019 at 10:43 AM Jeff Jirsa  wrote:

> Fwiw if you enable the tombstone compaction subproperties, you’ll compact
> away most of the other data in those old sstables (but not the partition
> that’s been manually updated)
>
> Also table level TTLs help catch this type of manual manipulation -
> consider adding it if appropriate.
>
> --
> Jeff Jirsa
>
>
> On May 6, 2019, at 7:29 AM, Mike Torra 
> wrote:
>
> Compaction settings:
> ```
> compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS',
> 'max_threshold': '32', 'min_threshold': '4'}
> ```
> read_repair_chance is 0, and I don't do any repairs because (normally)
> everything has a ttl. It does seem like Jeff is right that a manual
> insert/update without a ttl is what caused this, so I know how to resolve
> it and prevent it from happening again.
>
> Thx again for all the help guys, I appreciate it!
>
>
> On Fri, May 3, 2019 at 11:21 PM Jeff Jirsa  wrote:
>
>> Repairs work fine with TWCS, but having a non-expiring row will prevent
>> tombstones in newer sstables from being purged
>>
>> I suspect someone did a manual insert/update without a ttl and that
>> effectively blocks all other expiring cells from being purged.
>>
>> --
>> Jeff Jirsa
>>
>>
>> On May 3, 2019, at 7:57 PM, Nick Hatfield 
>> wrote:
>>
>> Hi Mike,
>>
>>
>>
>> If you will, share your compaction settings. More than likely, your issue
>> is from 1 of 2 reasons:
>> 1. You have read repair chance set to anything other than 0
>>
>> 2. You’re running repairs on the TWCS CF
>>
>>
>>
>> Or both….
>>
>>
>>
>> *From:* Mike Torra [mailto:mto...@salesforce.com.INVALID
>> ]
>> *Sent:* Friday, May 03, 2019 3:00 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: TWCS sstables not dropping even though all data is expired
>>
>>
>>
>> Thx for the help Paul - there are definitely some details here I still
>> don't fully understand, but this helped me resolve the problem and know
>> what to look for in the future :)
>>
>>
>>
>> On Fri, May 3, 2019 at 12:44 PM Paul Chandler  wrote:
>>
>> Hi Mike,
>>
>>
>>
>> For TWCS the sstable can only be deleted when all the data has expired in
>> that sstable, but you had a record without a ttl in it, so that sstable
>> could never be deleted.
>>
>>
>>
>> That bit is straight forward, the next bit I remember reading somewhere
>> but can’t find it at the moment to confirm my thinking.
>>
>>
>>
>> An sstable can only be deleted if it is the earliest sstable. I think
>> this is due to the fact that deleting later sstables may expose old
>> versions of the data stored in the stuck sstable which had been superseded.
>> For example, if there was a tombstone in a later sstable for the non TTLed
>> record causing the problem in this instance. Then deleting that sstable
>> would cause that deleted data to reappear. (Someone please correct me if I
>> have this wrong)
>>
>>
>>
>> Because sstables in different time buckets are never compacted together,
>> this problem only goes away when you did the major compaction.
>>
>>
>>
>> This would happen on all replicas of the data, hence the reason you this
>> problem on 3 nodes.
>>
>>
>>
>> Thanks
>>
>>
>>
>> Paul
>>
>> www.redshots.com
>>
>>
>>
>> On 3 May 2019, at 15:35, Mike Torra 
>> wrote:
>>
>>
>>
>> This does indeed seem to be a problem of overlapping sstables, but I
>> don't understand why the data (and number of sstables) just continues to
>> grow indefinitely. I also don't understand why this problem is only
>> appearing on some nodes. Is it just a coincidence that the one rogue test
>> row without a ttl is at the 'root' sstable causing the problem (ie, from
>> the output of `sstableexpiredblockers`)?
>>
>>
>>
>> Running a full compaction via `nodetool compact` reclaims the disk space,
>> but I'd like to figure out why this happened and prevent it. Understanding
>> why

Recovery for deleted SSTables files for one column family.

2016-05-19 Thread Mike Yeap

Hi all, I would like to know, is there any way to rebuild a particular
column family when all the SSTables files for this column family are
missing?? Say we do not have any backup of it.

Thank you.

Regards,
Mike Yeap

Re: Recovery for deleted SSTables files for one column family.

2016-05-19 Thread Mike Yeap

Hi Ben, the scenario that I was trying to test was all sstables (deleted)
from one node. So I did what you suggested (rebuild the sstables from other
replicas in the cluster) and it rebuilt the sstables successfully.

I think the reason that I didn't see the sstables rebuilt earlier on was
because I didn't use the -full option of the "nodetool rebuild".

Thanks!

Regards,
Mike Yeap

On Thu, May 19, 2016 at 4:03 PM, Ben Slater 
wrote:

> Use nodetool listsnapshots to check if you have a snapshot - in default
> configuration, Cassandra takes snapshots for operations like truncate.
>
> Failing that, is it all sstables from all nodes? In this case, your data
> has gone I’m afraid. If it’s just all sstables from one node then running
> repair will rebuild the sstables from the other replicas in the cluster.
>
> Cheers
> Ben
>
> On Thu, 19 May 2016 at 17:57 Mike Yeap  wrote:
>
>> Hi all, I would like to know, is there any way to rebuild a particular
>> column family when all the SSTables files for this column family are
>> missing?? Say we do not have any backup of it.
>>
>> Thank you.
>>
>> Regards,
>> Mike Yeap
>>
> --
> 
> Ben Slater
> Chief Product Officer, Instaclustr
> +61 437 929 798
>

Cassandra and Kubernetes and scaling

2016-05-24 Thread Mike Wojcikiewicz

I saw a thread from April 2016 talking about Cassandra and Kubernetes, and
have a few follow up questions.  It seems that especially after v1.2 of
Kubernetes, and the upcoming 1.3 features, this would be a very viable
option of running Cassandra on.

My questions pertain to HostIds and Scaling Up/Down, and are related:

1.  If a container's host dies and is then brought up on another host, can
you start up with the same PersistentVolume as the original container had?
Which begs the question would the new container get a new HostId, implying
it would need to bootstrap into the environment?   If it's a bootstrap,
does the old one get deco'd/assassinated?

2. Scaling up/down.  Scaling up would be relatively easy, as it should just
kick off Bootstrapping the node into the cluster, but what if you need to
scale down?  Would the Container get deco'd by the scaling down process? or
just terminated, leaving you with potential missing replicas

3. Scaling up and increasing the RF of a particular keyspace, would there
be a clean way to do this with the kubernetes tooling?

In the end I'm wondering how much of the Kubernetes + Cassandra involves
nodetool, and how much is just a Docker image where you need to manage that
all yourself (painfully)

-- 
--mike

Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Mike Yeap

Hi Luke, I've encountered similar problem before, could you please advise
on following?

1) when you add 10.128.0.20, what are the seeds defined in cassandra.yaml?

2) when you add 10.128.0.20, were the data and cache directories in
10.128.0.20 empty?

   - /var/lib/cassandra/data
   - /var/lib/cassandra/saved_caches

3) if you do a compact in 10.128.0.3, what is the size shown in "Load"
column in "nodetool status "?

4) when you do the full repair, did you use "nodetool repair" or "nodetool
repair -full"? I'm asking this because Incremental Repair is the default
for Cassandra 2.2 and later.


Regards,
Mike Yeap

On Wed, May 25, 2016 at 8:01 AM, Bryan Cheng  wrote:

> Hi Luke,
>
> I've never found nodetool status' load to be useful beyond a general
> indicator.
>
> You should expect some small skew, as this will depend on your current
> compaction status, tombstones, etc. IIRC repair will not provide
> consistency of intermediate states nor will it remove tombstones, it only
> guarantees consistency in the final state. This means, in the case of
> dropped hints or mutations, you will see differences in intermediate
> states, and therefore storage footrpint, even in fully repaired nodes. This
> includes intermediate UPDATE operations as well.
>
> Your one node with sub 1GB sticks out like a sore thumb, though. Where did
> you originate the nodetool repair from? Remember that repair will only
> ensure consistency for ranges held by the node you're running it on. While
> I am not sure if missing ranges are included in this, if you ran nodetool
> repair only on a machine with partial ownership, you will need to complete
> repairs across the ring before data will return to full consistency.
>
> I would query some older data using consistency = ONE on the affected
> machine to determine if you are actually missing data.  There are a few
> outstanding bugs in the 2.1.x  and older release families that may result
> in tombstone creation even without deletes, for example CASSANDRA-10547,
> which impacts updates on collections in pre-2.1.13 Cassandra.
>
> You can also try examining the output of nodetool ring, which will give
> you a breakdown of tokens and their associations within your cluster.
>
> --Bryan
>
> On Tue, May 24, 2016 at 3:49 PM, kurt Greaves 
> wrote:
>
>> Not necessarily considering RF is 2 so both nodes should have all
>> partitions. Luke, are you sure the repair is succeeding? You don't have
>> other keyspaces/duplicate data/extra data in your cassandra data directory?
>> Also, you could try querying on the node with less data to confirm if it
>> has the same dataset.
>>
>> On 24 May 2016 at 22:03, Bhuvan Rawal  wrote:
>>
>>> For the other DC, it can be acceptable because partition reside on one
>>> node, so say  if you have a large partition, it may skew things a bit.
>>> On May 25, 2016 2:41 AM, "Luke Jolly"  wrote:
>>>
>>>> So I guess the problem may have been with the initial addition of the
>>>> 10.128.0.20 node because when I added it in it never synced data I
>>>> guess?  It was at around 50 MB when it first came up and transitioned to
>>>> "UN". After it was in I did the 1->2 replication change and tried repair
>>>> but it didn't fix it.  From what I can tell all the data on it is stuff
>>>> that has been written since it came up.  We never delete data ever so we
>>>> should have zero tombstones.
>>>>
>>>> If I am not mistaken, only two of my nodes actually have all the data,
>>>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
>>>> is almost a GB lower and then of course 10.128.0.20 which is missing
>>>> over 5 GB of data.  I tried running nodetool -local on both DCs and it
>>>> didn't fix either one.
>>>>
>>>> Am I running into a bug of some kind?
>>>>
>>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal 
>>>> wrote:
>>>>
>>>>> Hi Luke,
>>>>>
>>>>> You mentioned that replication factor was increased from 1 to 2. In
>>>>> that case was the node bearing ip 10.128.0.20 carried around 3GB data
>>>>> earlier?
>>>>>
>>>>> You can run nodetool repair with option -local to initiate repair
>>>>> local datacenter for gce-us-central1.
>>>>>
>>>>> Also you may suspect that if a lot of data was deleted while the node
>>>>> was down it may be having a lot of tombstones which is not needed to be
>

Re: Error while rebuilding a node: Stream failed

2016-05-24 Thread Mike Yeap

Hi George, are you using NetworkTopologyStrategy as the replication
strategy for your keyspace? If yes, can you check the
cassandra-rackdc.properties of this new node?

https://issues.apache.org/jira/browse/CASSANDRA-8279


Regards,
Mike Yeap

On Wed, May 25, 2016 at 2:31 PM, George Sigletos 
wrote:

> I am getting this error repeatedly while I am trying to add a new DC
> consisting of one node in AWS to my existing cluster. I have tried 5 times
> already. Running Cassandra 2.1.13
>
> I have also set:
> streaming_socket_timeout_in_ms: 360
> in all of my nodes
>
> Does anybody have any idea how this can be fixed? Thanks in advance
>
> Kind regards,
> George
>
> P.S.
> The complete stack trace:
> -- StackTrace --
> java.lang.RuntimeException: Error while rebuilding node: Stream failed
> at
> org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1076)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at sun.reflect.misc.Trampoline.invoke(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at sun.reflect.misc.MethodUtil.invoke(Unknown Source)
> at
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source)
> at
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source)
> at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(Unknown
> Source)
> at com.sun.jmx.mbeanserver.PerInterface.invoke(Unknown Source)
> at com.sun.jmx.mbeanserver.MBeanSupport.invoke(Unknown Source)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(Unknown Source)
> at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(Unknown Source)
> at
> javax.management.remote.rmi.RMIConnectionImpl.doOperation(Unknown Source)
> at
> javax.management.remote.rmi.RMIConnectionImpl.access$300(Unknown Source)
> at
> javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(Unknown
> Source)
> at
> javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(Unknown
> Source)
> at javax.management.remote.rmi.RMIConnectionImpl.invoke(Unknown
> Source)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
> at sun.rmi.transport.Transport$2.run(Unknown Source)
> at sun.rmi.transport.Transport$2.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.rmi.transport.Transport.serviceCall(Unknown Source)
> at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown
> Source)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(Unknown
> Source)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(Unknown Source)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
>

Re: Out of memory issues

2016-05-26 Thread Mike Yeap

Hi Paolo,

a) was there any large insertion done?
b) are the a lot of files in the saved_caches directory?
c) would you consider to increase the HEAP_NEWSIZE to, say, 1200M?


Regards,
Mike Yeap

On Fri, May 27, 2016 at 12:39 AM, Paolo Crosato <
paolo.cros...@targaubiest.com> wrote:

> Hi,
>
> we are running a cluster of 4 nodes, each one has the same sizing: 2
> cores, 16G ram and 1TB of disk space.
>
> On every node we are running cassandra 2.0.17, oracle java version
> "1.7.0_45", centos 6 with this kernel version 2.6.32-431.17.1.el6.x86_64
>
> Two nodes are running just fine, the other two have started to go OOM at
> every start.
>
> This is the error we get:
>
> INFO [ScheduledTasks:1] 2016-05-26 18:15:58,460 StatusLogger.java (line
> 70) ReadRepairStage   0 0116
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:15:58,462 StatusLogger.java (line
> 70) MutationStage31  1369  20526
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:15:58,590 StatusLogger.java (line
> 70) ReplicateOnWriteStage 0 0  0
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:15:58,591 StatusLogger.java (line
> 70) GossipStage   0 0335
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:16:04,195 StatusLogger.java (line
> 70) CacheCleanupExecutor  0 0  0
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:16:06,526 StatusLogger.java (line
> 70) MigrationStage0 0  0
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:16:06,527 StatusLogger.java (line
> 70) MemoryMeter   1 4 26
> 0 0
>  INFO [ScheduledTasks:1] 2016-05-26 18:16:06,527 StatusLogger.java (line
> 70) ValidationExecutor0 0  0
> 0 0
> DEBUG [MessagingService-Outgoing-/10.255.235.19] 2016-05-26 18:16:06,518
> OutboundTcpConnection.java (line 290) attempting to connect to /
> 10.255.235.19
>  INFO [GossipTasks:1] 2016-05-26 18:16:22,912 Gossiper.java (line 992)
> InetAddress /10.255.235.28 is now DOWN
>  INFO [ScheduledTasks:1] 2016-05-26 18:16:22,952 StatusLogger.java (line
> 70) FlushWriter   1 5 47
> 025
>  INFO [ScheduledTasks:1] 2016-05-26 18:16:22,953 StatusLogger.java (line
> 70) InternalResponseStage 0 0  0
> 0 0
> ERROR [ReadStage:27] 2016-05-26 18:16:29,250 CassandraDaemon.java (line
> 258) Exception in thread Thread[ReadStage:27,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:347)
> at
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
> at
> org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355)
> at
> org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:124)
> at
> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:85)
> at org.apache.cassandra.db.Column$1.computeNext(Column.java:75)
> at org.apache.cassandra.db.Column$1.computeNext(Column.java:64)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:153)
> at
> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:434)
> at
> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:387)
> at
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:145)
> at
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:45)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
> at
> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:157)
> at
> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:140)
> at
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(Mer

Re: Node Stuck while restarting

2016-05-29 Thread Mike Yeap

Hi Bhuvan, how big are your current commit logs in the failed node, and
what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE?

Also the values of following properties in cassandra.yaml??

memtable_allocation_type
memtable_cleanup_threshold
memtable_flush_writers
memtable_heap_space_in_mb
memtable_offheap_space_in_mb


Regards,
Mike Yeap



On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal  wrote:

> Hi,
>
> We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each.
> One of the node was showing UNREACHABLE on other nodes in nodetool
> describecluster  and on that node it was showing all others UNREACHABLE and
> as a measure we restarted the node.
>
> But on doing that it is stuck possibly at with these messages in
> system.log:
>
> DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 ColumnFamilyStore.java:829
> - Enqueuing flush of batches: 226784704 (11%) on-heap, 0 (0%) off-heap
> DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 -
> Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6,
> messaging version 10, compression null)
> DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 -
> Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap
>
> MemtablePostFlush / MemtableFlushWriter stages where it is stuck with
> pending messages.
> This has been the status of them as per *nodetool tpstats *for long.
> MemtablePostFlush Active - 1pending - 52
> completed - 16
> MemtableFlushWriter   Active - 2pending - 13
> completed - 15
>
>
> We restarted the node by setting log level to TRACE but in vain. What
> could be a possible contingency plan in such a scenario?
>
> Best Regards,
> Bhuvan
>
>

Re: [Marketing Mail] Cassandra 2.1: Snapshot data changing while transferring

2016-05-31 Thread Mike Yeap

Hi Paul, what is the value of the snapshot_before_compaction property in
your cassandra.yaml?

Say if another snapshot is being taken (because compaction kicked in and
snapshot_before_compaction property is set to TRUE) and at this moment
you're tarring the snapshot folders..

Maybe can take a look at the records in system.compaction:

select * from system.compaction_history;


Regards,
Mike Yeap



On Tue, May 31, 2016 at 5:21 PM, Paul Dunkler  wrote:

> And - as an addition:
>
> Shoudln't that be documented that even snapshot files can change?
>
> I guess this might come from the incremental repairs...
>
> The repair time is stored in the sstable (RepairedAt timestamp metadata).
>
>
> ok, that sounds interesting.
> Could that also happen to incremental backup files as well? I had another
> case where incremental backup files were totally deleted automagically.
>
> And - what is the suggested way to solve that problem? Should i try again
> tar-ing the snapshot until it doesn't happen anymore that something changes
> in between?
> Or is there a way to "pause" the incremental repairs?
>
>
> Cheers,
> Reynald
>
> On 31/05/2016 11:03, Paul Dunkler wrote:
>
> Hi there,
>
> i am sometimes running in very strange errors while backing up snapshots
> from a cassandra cluster.
>
> Cassandra version:
> 2.1.11
>
> What i basically do:
> 1. nodetool snapshot
> 2. tar all snapshot folders into one file
> 3. transfer them to another server
>
> What happens is that tar just sometimes give the error message "file
> changed as we read it" while its adding a .db-file from the folder of the
> previously created snapshot.
> If i understand everything correct, this SHOULD never happen. Snapshots
> should be totally immutable, right?
>
> Am i maybe hitting a bug or is there some rare case with running repair
> operations or what-so-ever which can change snapshotted data?
> I already searched through cassandra jira but couldn't find a bug which
> looks related to this behaviour.
>
> Would love to get some help on this.
>
> —
> Paul Dunkler
>
>
>
> —
> Paul Dunkler
>
> ** * * UPLEX - Nils Goroll Systemoptimierung
>
> Scheffelstraße 32
> 22301 Hamburg
>
> tel +49 40 288 057 31
> mob +49 151 252 228 42
> fax +49 40 429 497 53
>
> xmpp://pauldunk...@jabber.ccc.de
>
> http://uplex.de/
>
>
> —
> Paul Dunkler
>
> ** * * UPLEX - Nils Goroll Systemoptimierung
>
> Scheffelstraße 32
> 22301 Hamburg
>
> tel +49 40 288 057 31
> mob +49 151 252 228 42
> fax +49 40 429 497 53
>
> xmpp://pauldunk...@jabber.ccc.de
>
> http://uplex.de/
>
>

Ring connection timeouts with 2.2.6

2016-06-23 Thread Mike Heffner

Hi,

We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
sitting at <25% CPU, doing mostly writes, and not showing any particular
long GC times/pauses. By all observed metrics the ring is healthy and
performing well.

However, we are noticing a pretty consistent number of connection timeouts
coming from the messaging service between various pairs of nodes in the
ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
per minute, usually between two pairs of nodes for several hours at a time.
It seems to occur for several hours at a time, then may stop or move to
other pairs of nodes in the ring. The metric
"Connection.SmallMessageDroppedTasks." will also grow for one pair of
the nodes in the TotalTimeouts metric.

Looking at the debug log typically shows a large number of messages like
the following on one of the nodes:

StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)

We have cross node timeouts enabled, but ntp is running on all nodes and no
node appears to have time drift.

The network appears to be fine between nodes, with iperf tests showing that
we have a lot of headroom.

Any thoughts on what to look for? Can we increase thread count/pool sizes
for the messaging service?

Thanks,

Mike

-- 

  Mike Heffner 
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

2016-06-25 Thread Mike Heffner

One thing to add, if we do a rolling restart of the ring the timeouts
disappear entirely for several hours and performance returns to normal.
It's as if something is leaking over time, but we haven't seen any
noticeable change in heap.

On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner  wrote:

> Hi,
>
> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
> sitting at <25% CPU, doing mostly writes, and not showing any particular
> long GC times/pauses. By all observed metrics the ring is healthy and
> performing well.
>
> However, we are noticing a pretty consistent number of connection timeouts
> coming from the messaging service between various pairs of nodes in the
> ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
> per minute, usually between two pairs of nodes for several hours at a time.
> It seems to occur for several hours at a time, then may stop or move to
> other pairs of nodes in the ring. The metric
> "Connection.SmallMessageDroppedTasks." will also grow for one pair of
> the nodes in the TotalTimeouts metric.
>
> Looking at the debug log typically shows a large number of messages like
> the following on one of the nodes:
>
> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>
> We have cross node timeouts enabled, but ntp is running on all nodes and
> no node appears to have time drift.
>
> The network appears to be fine between nodes, with iperf tests showing
> that we have a lot of headroom.
>
> Any thoughts on what to look for? Can we increase thread count/pool sizes
> for the messaging service?
>
> Thanks,
>
> Mike
>
> --
>
>   Mike Heffner 
>   Librato, Inc.
>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

2016-07-01 Thread Mike Heffner

Jens,

We haven't noticed any particular large GC operations or even persistently
high GC times.

Mike

On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil  wrote:

> Hi,
>
> Could it be garbage collection occurring on nodes that are more heavily
> loaded?
>
> Cheers,
> Jens
>
> Den sön 26 juni 2016 05:22Mike Heffner  skrev:
>
>> One thing to add, if we do a rolling restart of the ring the timeouts
>> disappear entirely for several hours and performance returns to normal.
>> It's as if something is leaking over time, but we haven't seen any
>> noticeable change in heap.
>>
>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner  wrote:
>>
>>> Hi,
>>>
>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>> performing well.
>>>
>>> However, we are noticing a pretty consistent number of connection
>>> timeouts coming from the messaging service between various pairs of nodes
>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>> timeouts per minute, usually between two pairs of nodes for several hours
>>> at a time. It seems to occur for several hours at a time, then may stop or
>>> move to other pairs of nodes in the ring. The metric
>>> "Connection.SmallMessageDroppedTasks." will also grow for one pair of
>>> the nodes in the TotalTimeouts metric.
>>>
>>> Looking at the debug log typically shows a large number of messages like
>>> the following on one of the nodes:
>>>
>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>>>
>>> We have cross node timeouts enabled, but ntp is running on all nodes and
>>> no node appears to have time drift.
>>>
>>> The network appears to be fine between nodes, with iperf tests showing
>>> that we have a lot of headroom.
>>>
>>> Any thoughts on what to look for? Can we increase thread count/pool
>>> sizes for the messaging service?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> --
>>>
>>>   Mike Heffner 
>>>   Librato, Inc.
>>>
>>>
>>
>>
>> --
>>
>>   Mike Heffner 
>>   Librato, Inc.
>>
>> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>



-- 

  Mike Heffner 
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

2016-07-05 Thread Mike Heffner

Jeff,

Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
believe we've hit the bugs mentioned in earlier driver versions.

Mike

On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa 
wrote:

> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
> depending on your instance types / hypervisor choice, you may want to
> ensure you’re not seeing that bug.
>
>
>
> *From: *Mike Heffner 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Friday, July 1, 2016 at 1:10 PM
> *To: *"user@cassandra.apache.org" 
> *Cc: *Peter Norton 
> *Subject: *Re: Ring connection timeouts with 2.2.6
>
>
>
> Jens,
>
>
>
> We haven't noticed any particular large GC operations or even persistently
> high GC times.
>
>
>
> Mike
>
>
>
> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil  wrote:
>
> Hi,
>
> Could it be garbage collection occurring on nodes that are more heavily
> loaded?
>
> Cheers,
> Jens
>
>
>
> Den sön 26 juni 2016 05:22Mike Heffner  skrev:
>
> One thing to add, if we do a rolling restart of the ring the timeouts
> disappear entirely for several hours and performance returns to normal.
> It's as if something is leaking over time, but we haven't seen any
> noticeable change in heap.
>
>
>
> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner  wrote:
>
> Hi,
>
>
>
> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
> sitting at <25% CPU, doing mostly writes, and not showing any particular
> long GC times/pauses. By all observed metrics the ring is healthy and
> performing well.
>
>
>
> However, we are noticing a pretty consistent number of connection timeouts
> coming from the messaging service between various pairs of nodes in the
> ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
> per minute, usually between two pairs of nodes for several hours at a time.
> It seems to occur for several hours at a time, then may stop or move to
> other pairs of nodes in the ring. The metric
> "Connection.SmallMessageDroppedTasks." will also grow for one pair of
> the nodes in the TotalTimeouts metric.
>
>
>
> Looking at the debug log typically shows a large number of messages like
> the following on one of the nodes:
>
>
>
> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
> (ttl 0)
>
> We have cross node timeouts enabled, but ntp is running on all nodes and
> no node appears to have time drift.
>
>
>
> The network appears to be fine between nodes, with iperf tests showing
> that we have a lot of headroom.
>
>
>
> Any thoughts on what to look for? Can we increase thread count/pool sizes
> for the messaging service?
>
>
>
> Thanks,
>
>
>
> Mike
>
>
>
> --
>
>
>   Mike Heffner 
>
>   Librato, Inc.
>
>
>
>
>
>
>
> --
>
>
>   Mike Heffner 
>
>   Librato, Inc.
>
>
>
> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>
>
>
>
>
> --
>
>
>   Mike Heffner 
>
>   Librato, Inc.
>
>
>



-- 

  Mike Heffner 
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

2016-07-15 Thread Mike Heffner

Just to followup on this post with a couple of more data points:

1)

We upgraded to 2.2.7 and did not see any change in behavior.

2)

However, what *has* fixed this issue for us was disabling msg coalescing by
setting:

otc_coalescing_strategy: DISABLED

We were using the default setting before (time horizon I believe).

We see periodic timeouts on the ring (once every few hours), but they are
brief and don't impact latency. With msg coalescing turned on we would see
these timeouts persist consistently after an initial spike. My guess is
that something in the coalescing logic is disturbed by the initial timeout
spike which leads to dropping all / high-percentage of all subsequent
traffic.

We are planning to continue production use with msg coaleasing disabled for
now and may run tests in our staging environments to identify where the
coalescing is breaking this.

Mike

On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner  wrote:

> Jeff,
>
> Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
> believe we've hit the bugs mentioned in earlier driver versions.
>
> Mike
>
> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa 
> wrote:
>
>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>> depending on your instance types / hypervisor choice, you may want to
>> ensure you’re not seeing that bug.
>>
>>
>>
>> *From: *Mike Heffner 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Friday, July 1, 2016 at 1:10 PM
>> *To: *"user@cassandra.apache.org" 
>> *Cc: *Peter Norton 
>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>
>>
>>
>> Jens,
>>
>>
>>
>> We haven't noticed any particular large GC operations or even
>> persistently high GC times.
>>
>>
>>
>> Mike
>>
>>
>>
>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil  wrote:
>>
>> Hi,
>>
>> Could it be garbage collection occurring on nodes that are more heavily
>> loaded?
>>
>> Cheers,
>> Jens
>>
>>
>>
>> Den sön 26 juni 2016 05:22Mike Heffner  skrev:
>>
>> One thing to add, if we do a rolling restart of the ring the timeouts
>> disappear entirely for several hours and performance returns to normal.
>> It's as if something is leaking over time, but we haven't seen any
>> noticeable change in heap.
>>
>>
>>
>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner  wrote:
>>
>> Hi,
>>
>>
>>
>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
>> sitting at <25% CPU, doing mostly writes, and not showing any particular
>> long GC times/pauses. By all observed metrics the ring is healthy and
>> performing well.
>>
>>
>>
>> However, we are noticing a pretty consistent number of connection
>> timeouts coming from the messaging service between various pairs of nodes
>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>> timeouts per minute, usually between two pairs of nodes for several hours
>> at a time. It seems to occur for several hours at a time, then may stop or
>> move to other pairs of nodes in the ring. The metric
>> "Connection.SmallMessageDroppedTasks." will also grow for one pair of
>> the nodes in the TotalTimeouts metric.
>>
>>
>>
>> Looking at the debug log typically shows a large number of messages like
>> the following on one of the nodes:
>>
>>
>>
>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
>> (ttl 0)
>>
>> We have cross node timeouts enabled, but ntp is running on all nodes and
>> no node appears to have time drift.
>>
>>
>>
>> The network appears to be fine between nodes, with iperf tests showing
>> that we have a lot of headroom.
>>
>>
>>
>> Any thoughts on what to look for? Can we increase thread count/pool sizes
>> for the messaging service?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Mike
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner 
>>
>>   Librato, Inc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner 
>>
>>   Librato, Inc.
>>
>>
>>
>> --
>>
>> Jens Rantil
>> Backend Developer @ Tink
>>
>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>> For urgent matters you can reach me at +46-708-84 18 32.
>>
>>
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner 
>>
>>   Librato, Inc.
>>
>>
>>
>
>
>
> --
>
>   Mike Heffner 
>   Librato, Inc.
>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

2016-07-23 Thread Mike Heffner

Garo,

No, we didn't notice any change in system load, just the expected spike in
packet counts.

Mike

On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen 
wrote:

> Just to pick this up: Did you see any system load spikes? I'm tracing a
> problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the
> normal average load is around 3-4. So far I haven't found any good reason,
> but I'm going to try otc_coalescing_strategy: disabled tomorrow.
>
>  - Garo
>
> On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner  wrote:
>
>> Just to followup on this post with a couple of more data points:
>>
>> 1)
>>
>> We upgraded to 2.2.7 and did not see any change in behavior.
>>
>> 2)
>>
>> However, what *has* fixed this issue for us was disabling msg coalescing
>> by setting:
>>
>> otc_coalescing_strategy: DISABLED
>>
>> We were using the default setting before (time horizon I believe).
>>
>> We see periodic timeouts on the ring (once every few hours), but they are
>> brief and don't impact latency. With msg coalescing turned on we would see
>> these timeouts persist consistently after an initial spike. My guess is
>> that something in the coalescing logic is disturbed by the initial timeout
>> spike which leads to dropping all / high-percentage of all subsequent
>> traffic.
>>
>> We are planning to continue production use with msg coaleasing disabled
>> for now and may run tests in our staging environments to identify where the
>> coalescing is breaking this.
>>
>> Mike
>>
>> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner  wrote:
>>
>>> Jeff,
>>>
>>> Thanks, yeah we updated to the 2.16.4 driver version from source. I
>>> don't believe we've hit the bugs mentioned in earlier driver versions.
>>>
>>> Mike
>>>
>>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa 
>>> wrote:
>>>
>>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>>>> depending on your instance types / hypervisor choice, you may want to
>>>> ensure you’re not seeing that bug.
>>>>
>>>>
>>>>
>>>> *From: *Mike Heffner 
>>>> *Reply-To: *"user@cassandra.apache.org" 
>>>> *Date: *Friday, July 1, 2016 at 1:10 PM
>>>> *To: *"user@cassandra.apache.org" 
>>>> *Cc: *Peter Norton 
>>>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>>>
>>>>
>>>>
>>>> Jens,
>>>>
>>>>
>>>>
>>>> We haven't noticed any particular large GC operations or even
>>>> persistently high GC times.
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Could it be garbage collection occurring on nodes that are more heavily
>>>> loaded?
>>>>
>>>> Cheers,
>>>> Jens
>>>>
>>>>
>>>>
>>>> Den sön 26 juni 2016 05:22Mike Heffner  skrev:
>>>>
>>>> One thing to add, if we do a rolling restart of the ring the timeouts
>>>> disappear entirely for several hours and performance returns to normal.
>>>> It's as if something is leaking over time, but we haven't seen any
>>>> noticeable change in heap.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>>> performing well.
>>>>
>>>>
>>>>
>>>> However, we are noticing a pretty consistent number of connection
>>>> timeouts coming from the messaging service between various pairs of nodes
>>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>>> timeouts per minute, usually between two pairs of nodes for several hours
>>>> at a time. It seems to occur for several hours at a time, then may stop or
>>>> move to other pairs of nodes in the ring. The metric
>

failing bootstraps with OOM

2016-11-02 Thread Mike Torra

Hi All -

I am trying to bootstrap a replacement node in a cluster, but it consistently
fails to bootstrap because of OOM exceptions. For almost a week I've been going
through cycles of bootstrapping, finding errors, then restarting / resuming
bootstrap, and I am struggling to move forward. Sometimes the bootstrapping
node itself fails, which usually manifests first as very high GC times
(sometimes 30s+!), then nodetool commands start to fail with timeouts, then the
node will crash with an OOM exception. Other times, a node streaming data to
this bootstrapping node will have a similar failure. In either case, when it
happens I need to restart the crashed node, then resume the bootstrap.

On top of these issues, when I do need to restart a node it takes a lng
time
(http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start).
This exasperates the problem because it takes so long to find out if a change
to the cluster helps or if it still fails. I am in the process of upgrading all
nodes in the cluster from m4.xlarge to c4.4xlarge, and I am running Cassandra
DDC 3.5 on all nodes. The cluster has 26 nodes spread across 4 regions in EC2.
Here is some other relevant cluster info (also in stack overflow post):

Cluster Info

* Cassandra DDC 3.5
* EC2MultiRegionSnitch
* m4.xlarge, moving to c4.4xlarge

Schema Info

* 3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default)
* no secondary indexes

I am unsure what to try next. The node that is currently having this bootstrap
problem is a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS
volume. The slow startup time might be because of the issues with a high number
of SSTables that Jeff Jirsa mentioned in a comment on the SO post, but I am at
a loss for the OOM issues. I've tried:

* Changing from CMS to G1 GC, which seemed to have helped a bit
* Upgrading from 3.5 to 3.9, which did not seem to help
* Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to
help, but I'm still having issues

I'd appreciate any suggestions on what else I can try to track down the cause
of these OOM exceptions.

- Mike

Re: failing bootstraps with OOM

2016-11-03 Thread Mike Torra

Hi Alex - I do monitor sstable counts and pending compactions, but probably not 
closely enough. In 3/4 regions the cluster is running in, both counts are very 
high - ~30-40k sstables for one particular CF, and on many nodes >1k pending 
compactions. I had noticed this before, but I didn't have a good sense of what 
a "high" number for these values was.

It makes sense to me why this would cause the issues I've seen. After 
increasing concurrent_compactors and compaction_throughput_mb_per_sec (to 8 and 
64mb, respectively), I'm starting to see those counts go down steadily. 
Hopefully that will resolve the OOM issues, but it looks like it will take a 
while for compactions to catch up.

Thanks for the suggestions, Alex

From: Oleksandr Shulgin 
mailto:oleksandr.shul...@zalando.de>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, November 2, 2016 at 1:07 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: failing bootstraps with OOM

On Wed, Nov 2, 2016 at 3:35 PM, Mike Torra 
mailto:mto...@demandware.com>> wrote:
>
> Hi All -
>
> I am trying to bootstrap a replacement node in a cluster, but it consistently 
> fails to bootstrap because of OOM exceptions. For almost a week I've been 
> going through cycles of bootstrapping, finding errors, then restarting / 
> resuming bootstrap, and I am struggling to move forward. Sometimes the 
> bootstrapping node itself fails, which usually manifests first as very high 
> GC times (sometimes 30s+!), then nodetool commands start to fail with 
> timeouts, then the node will crash with an OOM exception. Other times, a node 
> streaming data to this bootstrapping node will have a similar failure. In 
> either case, when it happens I need to restart the crashed node, then resume 
> the bootstrap.
>
> On top of these issues, when I do need to restart a node it takes a lng 
> time 
> (http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start).
>  This exasperates the problem because it takes so long to find out if a 
> change to the cluster helps or if it still fails. I am in the process of 
> upgrading all nodes in the cluster from m4.xlarge to c4.4xlarge, and I am 
> running Cassandra DDC 3.5 on all nodes. The cluster has 26 nodes spread 
> across 4 regions in EC2. Here is some other relevant cluster info (also in 
> stack overflow post):
>
> Cluster Info
>
> Cassandra DDC 3.5
> EC2MultiRegionSnitch
> m4.xlarge, moving to c4.4xlarge
>
> Schema Info
>
> 3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default)
> no secondary indexes
>
> I am unsure what to try next. The node that is currently having this 
> bootstrap problem is a pretty beefy box, with 16 cores, 30G of ram, and a 
> 3.2T EBS volume. The slow startup time might be because of the issues with a 
> high number of SSTables that Jeff Jirsa mentioned in a comment on the SO 
> post, but I am at a loss for the OOM issues. I've tried:
>
> Changing from CMS to G1 GC, which seemed to have helped a bit
> Upgrading from 3.5 to 3.9, which did not seem to help
> Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, 
> but I'm still having issues
>
> I'd appreciate any suggestions on what else I can try to track down the cause 
> of these OOM exceptions.

Hi,

Do you monitor pending compactions and actual number of SSTable files?

On startup Cassandra needs to touch most of the data files and also seems to 
keep some metadata about every relevant file in memory.  We once went into 
situation where we ended up with hundreds of thousands of files per node which 
resulted in OOMs on every other node of the ring, and startup time was of over 
half an hour (this was on version 2.1).

If you have much more files than you expect, then you should check and adjust 
your concurrent_compactors and compaction_throughput_mb_per_sec settings.  
Increase concurrent_compactors if you're behind (pending compactions metric is 
a hint) and consider un-throttling compaction before your situation is back to 
normal.

Cheers,
--
Alex

weird jvm metrics

2016-12-28 Thread Mike Torra

Hi There -

I recently upgraded from cassandra 3.5 to 3.9 (DDC), and I noticed that the 
"new" jvm metrics are reporting with an extra '.' character in them. Here is a 
snippet of what I see from one of my nodes:


ubuntu@ip-10-0-2-163:~$ sudo tcpdump -i eth0 -v dst port 2003 -A | grep 'jvm'

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 
bytes

.Je..l>.pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.capacity 
762371494 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.count 3054 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.used 762371496 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.capacity 515226631134 
1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.count 45572 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.used 515319762610 
1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.fd.usage 0.00 1482960946

My metrics.yaml looks like this:

graphite:
  -
period: 60
timeunit: 'SECONDS'
prefix: 'pi.cassandra.us-east-1.cassy-node1'
hosts:
 - host: '#RELAY_HOST#'
   port: 2003
predicate:
  color: "white"
  useQualifiedName: true
  patterns:
- "^org.+"
- "^jvm.+"
- "^java.lang.+"

All the org.* metrics come through fine, and the jvm.fd.usage metric strangely 
comes through fine, too. The rest of the jvm.* metrics have this extra '.' 
character that causes them to not show up in graphite.

Am I missing something silly here? Appreciate any help or suggestions.

- Mike

Re: weird jvm metrics

2017-01-04 Thread Mike Torra

Just bumping - has anyone seen this before?

http://stackoverflow.com/questions/41446352/cassandra-3-9-jvm-metrics-have-bad-name

From: Mike Torra mailto:mto...@demandware.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, December 28, 2016 at 4:49 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: weird jvm metrics

Hi There -

I recently upgraded from cassandra 3.5 to 3.9 (DDC), and I noticed that the 
"new" jvm metrics are reporting with an extra '.' character in them. Here is a 
snippet of what I see from one of my nodes:


ubuntu@ip-10-0-2-163:~$ sudo tcpdump -i eth0 -v dst port 2003 -A | grep 'jvm'

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 
bytes

.Je..l>.pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.capacity 
762371494 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.count 3054 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.used 762371496 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.capacity 515226631134 
1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.count 45572 1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.used 515319762610 
1482960946

pi.cassandra.us-east-1.cassy-node1.jvm.fd.usage 0.00 1482960946

My metrics.yaml looks like this:

graphite:
  -
period: 60
timeunit: 'SECONDS'
prefix: 'pi.cassandra.us-east-1.cassy-node1'
hosts:
 - host: '#RELAY_HOST#'
   port: 2003
predicate:
  color: "white"
  useQualifiedName: true
  patterns:
- "^org.+"
- "^jvm.+"
- "^java.lang.+"

All the org.* metrics come through fine, and the jvm.fd.usage metric strangely 
comes through fine, too. The rest of the jvm.* metrics have this extra '.' 
character that causes them to not show up in graphite.

Am I missing something silly here? Appreciate any help or suggestions.

- Mike

implementing a 'sorted set' on top of cassandra

2017-01-13 Thread Mike Torra

We currently use redis to store sorted sets that we increment many, many times 
more than we read. For example, only about 5% of these sets are ever read. We 
are getting to the point where redis is becoming difficult to scale (currently 
at >20 nodes).

We've started using cassandra for other things, and now we are experimenting to 
see if having a similar 'sorted set' data structure is feasible in cassandra. 
My approach so far is:

  1.  Use a counter CF to store the values I want to sort by
  2.  Periodically read in all key/values in the counter CF and sort in the 
client application (~every five minutes or so)
  3.  Write back to a different CF with the ordered keys I care about

Does this seem crazy? Is there a simpler way to do this in cassandra?

Re: implementing a 'sorted set' on top of cassandra

2017-01-17 Thread Mike Torra

Thanks for the feedback everyone! Redis `zincryby` and `zrangebyscore` is 
indeed what we use today.

Caching the resulting 'sorted sets' in redis is exactly what I plan to do. 
There will be tens of thousands of these sorted sets, each generally with <10k 
items (with maybe a few exceptions going a bit over that). The reason to 
periodically calculate the set and store it in cassandra is to avoid having the 
client do that work, when the client only really cares about the top 100 or so 
items at any given time. Being truly "real time" is not critical for us, but it 
is a selling point to be as up to date as possible.

I'd like to understand the performance issue of frequently updating these sets. 
I understand that every time I 'regenerate' the sorted set, any rows that 
change will create a tombstone - for example, if "item_1" is in first place and 
"item_2" is in second place, then they switch on the next update, that would be 
two tombstones. Do you think this will be a big enough problem that it is worth 
doing the sorting work client side, on demand, and just try to eat the 
performance hit there? My thought was to make a tradeoff by using more 
cassandra disk space (ie pre calculating all sets), in exchange for faster 
reads when requests actually come in that need this data.

From: Benjamin Roth mailto:benjamin.r...@jaumo.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Saturday, January 14, 2017 at 1:25 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: implementing a 'sorted set' on top of cassandra

Mike mentioned "increment" in his initial post. That let me think of a case 
with increments and fetching a top list by a counter like
https://redis.io/commands/zincrby
https://redis.io/commands/zrangebyscore

1. Cassandra is absolutely not made to sort by a counter (or a non-counter 
numeric incrementing value) but it is made to store counters. In this case a 
partition could be seen as a set.
2. I thought of CS for persistence and - depending on the app requirements like 
real-time and set size - still use redis as a read cache

2017-01-14 18:45 GMT+01:00 Jonathan Haddad 
mailto:j...@jonhaddad.com>>:
Sorted sets don't have a requirement of incrementing / decrementing. They're 
commonly used for thing like leaderboards where the values are arbitrary.

In Redis they are implemented with 2 data structures for efficient lookups of 
either key or value. No getting around that as far as I know.

In Cassandra they would require using the score as a clustering column in order 
to select top N scores (and paginate). That means a tombstone whenever the 
value for a key in the set changes. In sets with high rates of change that 
means a lot of tombstones and thus terrible performance.
On Sat, Jan 14, 2017 at 9:40 AM DuyHai Doan 
mailto:doanduy...@gmail.com>> wrote:
Sorting on an "incremented" numeric value has always been a nightmare to be 
done properly in C*

Either use Counter type but then no sorting is possible since counter cannot be 
used as type for clustering column (which allows sort)

Or use simple numeric type on clustering column but then to increment the value 
*concurrently* and *safely* it's prohibitive (SELECT to fetch current value + 
UPDATE ... IF value = ) + retry



On Sat, Jan 14, 2017 at 8:54 AM, Benjamin Roth 
mailto:benjamin.r...@jaumo.com>> wrote:
If your proposed solution is crazy depends on your needs :)
It sounds like you can live with not-realtime data. So it is ok to cache it. 
Why preproduce the results if you only need 5% of them? Why not use redis as a 
cache with expiring sorted sets that are filled on demand from cassandra 
partitions with counters?
So redis has much less to do and can scale much better. And you are not limited 
on keeping all data in ram as cache data is volatile and can be evicted on 
demand.
If this is effective also depends on the size of your sets. CS wont be able to 
sort them by score for you, so you will have to load the complete set to redis 
for caching and / or do sorting in your app on demand. This certainly won't 
work out well with sets with millions of entries.

2017-01-13 23:14 GMT+01:00 Mike Torra 
mailto:mto...@demandware.com>>:
We currently use redis to store sorted sets that we increment many, many times 
more than we read. For example, only about 5% of these sets are ever read. We 
are getting to the point where redis is becoming difficult to scale (currently 
at >20 nodes).

We've started using cassandra for other things, and now we are experimenting to 
see if having a similar 'sorted set' data structure is feasible in cassandra. 
My approach so far is:

  1.  Use a counter CF to store the values I wan

lots of connection timeouts around same time every day

2017-02-16 Thread Mike Torra

Hi  there -

Cluster info:
C* 3.9, replicated across 4 EC2 regions (us-east-1, us-west-2, eu-west-1,
ap-southeast-1), c4.4xlarge

Around the same time every day (~7-8am EST), 2 DC's (eu-west-1 and
ap-southeast-1) in our cluster start experiencing a high number of timeouts
(Connection.TotalTimeouts metric). The issue seems to occur equally on all
nodes in the impacted DC. I'm trying to track down exactly what is timing
out, and what is causing it to happen.

With debug logs, I can see many messages like this:

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 -
Convicting /xx.xx.xx.xx with status NORMAL - alive false

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 -
Convicting /xx.xx.xx.xx with status removed - alive false

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 -
Convicting /xx.xx.xx.xx with status shutdown - alive false

The 'status removed' node I `nodetool remove`'d from the cluster, so I'm
not sure why that appears. The node mentioned in the 'status NORMAL' line
has constant warnings like this:

WARN  [GossipTasks:1] 2017-02-16 15:40:02,845 Gossiper.java:771 - Gossip
stage has 453589 pending tasks; skipping status check (no nodes will be
marked down)

These lines seem to go away after restarting that node, and on the original
node, the 'Convicting' lines go away as well. However, the timeout counts
do not seem to change. Why does restarting the node seem to fix gossip
falling behind?


There are also a lot of debug log messages like this:

DEBUG [GossipStage:1] 2017-02-16 15:45:04,849 FailureDetector.java:456 -
Ignoring interval time of 2355580769 for /xx.xx.xx.xx

Could these be related to the high number of timeouts I see? I've also
tried increasing the value of phi_convict_threshold to 12, as suggested
here:
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeFailDetect.html.
This does not seem to have changed anything on the nodes that I've changed
it on.

I appreciate any suggestions on what else to try in order to track down
these timeouts.

- Mike

Re: lots of connection timeouts around same time every day

2017-02-17 Thread Mike Torra

I can't say that I have tried that while the issue is going on, but I have
done such rolling restarts for sure, and the timeouts still occur every
day. What would a rolling restart do to fix the issue?

In fact, as I write this, I am restarting each node one by one in the
eu-west-1 datacenter, and in us-east-1 I am seeing lots of timeouts - both
the metrics 'Connection.TotalTimeouts.m1_rate' and
'ClientRequest.Latency.Read.p999' flatlining at ~6s. Why would restarting
in one datacenter impact reads in another?

Any suggestions on what to investigate next, or what changes to try in the
cluster? Happy to provide any more info as well :)

On Fri, Feb 17, 2017 at 6:05 AM, kurt greaves  wrote:

> have you tried a rolling restart of the entire DC?
>

Significant drop in storage load after 2.1.6->2.1.8 upgrade

2015-07-17 Thread Mike Heffner

Hi all,

I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've
noticed that after the upgrade our storage load drops significantly (I've
seen up to an 80% drop).

I believe most of the data that is dropped is tombstoned (via TTL
expiration) and I haven't detected any data loss yet. However, can someone
point me to what changed between 2.1.6 and 2.1.8 that would lead to such a
significant drop in tombstoned data? Looking at the changelog there's
nothing that jumps out at me. This is a CF definition from one of the CFs
that had a significant drop:

> describe measures_mid_1;

CREATE TABLE "Metrics".measures_mid_1 (
key blob,
c1 int,
c2 blob,
c3 blob,
PRIMARY KEY (key, c1, c2)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (c1 ASC, c2 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

Thanks,

Mike

-- 

  Mike Heffner 
  Librato, Inc.

Re: Significant drop in storage load after 2.1.6->2.1.8 upgrade

2015-07-19 Thread Mike Heffner

Nate,

Thanks. I dug through the changes a bit more and I believe my original
observation may have been due to:

https://github.com/krummas/cassandra/commit/fbc47e3b950949a8aa191bc7e91eb6cb396fe6a8

from: https://issues.apache.org/jira/browse/CASSANDRA-9572

I had originally passed over it because we are not using DTCS, but it
matches since the upgrade appeared to only drop fully expired sstables.


Mike

On Sat, Jul 18, 2015 at 3:40 PM, Nate McCall  wrote:

> Perhaps https://issues.apache.org/jira/browse/CASSANDRA-9592 got
> compactions moving forward for you? This would explain the drop.
>
> However, the discussion on
> https://issues.apache.org/jira/browse/CASSANDRA-9683 seems to be similar
> to what you saw and that is currently being investigated.
>
> On Fri, Jul 17, 2015 at 10:24 AM, Mike Heffner  wrote:
>
>> Hi all,
>>
>> I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've
>> noticed that after the upgrade our storage load drops significantly (I've
>> seen up to an 80% drop).
>>
>> I believe most of the data that is dropped is tombstoned (via TTL
>> expiration) and I haven't detected any data loss yet. However, can someone
>> point me to what changed between 2.1.6 and 2.1.8 that would lead to such a
>> significant drop in tombstoned data? Looking at the changelog there's
>> nothing that jumps out at me. This is a CF definition from one of the CFs
>> that had a significant drop:
>>
>> > describe measures_mid_1;
>>
>> CREATE TABLE "Metrics".measures_mid_1 (
>> key blob,
>> c1 int,
>> c2 blob,
>> c3 blob,
>> PRIMARY KEY (key, c1, c2)
>> ) WITH COMPACT STORAGE
>> AND CLUSTERING ORDER BY (c1 ASC, c2 ASC)
>> AND bloom_filter_fp_chance = 0.01
>> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>> AND comment = ''
>> AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
>> AND compression = {'sstable_compression':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>     AND dclocal_read_repair_chance = 0.1
>> AND default_time_to_live = 0
>> AND gc_grace_seconds = 0
>> AND max_index_interval = 2048
>> AND memtable_flush_period_in_ms = 0
>> AND min_index_interval = 128
>> AND read_repair_chance = 0.0
>> AND speculative_retry = '99.0PERCENTILE';
>>
>> Thanks,
>>
>> Mike
>>
>> --
>>
>>   Mike Heffner 
>>   Librato, Inc.
>>
>>
>
>
> --
> -
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 

  Mike Heffner 
  Librato, Inc.

Debugging write timeouts on Cassandra 2.2.5

2016-02-10 Thread Mike Heffner

Hi all,

We've recently embarked on a project to update our Cassandra infrastructure
running on EC2. We are long time users of 2.0.x and are testing out a move
to version 2.2.5 running on VPC with EBS. Our test setup is a 3 node, RF=3
cluster supporting a small write load (mirror of our staging load).

We are writing at QUORUM and while p95's look good compared to our staging
2.0.x cluster, we are seeing frequent write operations that time out at the
max write_request_timeout_in_ms (10 seconds). CPU across the cluster is <
10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle
JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms.

We run on c4.2xl instances with GP2 EBS attached storage for data and
commitlog directories. The nodes are using EC2 enhanced networking and have
the latest Intel network driver module. We are running on HVM instances
using Ubuntu 14.04.2.

Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to
the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a

This is our cassandra.yaml:
https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml

Like I mentioned we use 8u60 with G1GC and have used many of the GC
settings in Al Tobey's tuning guide. This is our upstart config with JVM
and other CPU settings:
https://gist.github.com/mheffner/dc44613620b25c4fa46d

We've used several of the sysctl settings from Al's guide as well:
https://gist.github.com/mheffner/ea40d58f58a517028152

Our client application is able to write using either Thrift batches using
Asytanax driver or CQL async INSERT's using the Datastax Java driver.

For testing against Thrift (our legacy infra uses this) we write batches of
anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
around 45ms but our maximum (p100) sits less than 150ms except when it
periodically spikes to the full 10seconds.

Testing the same write path using CQL writes instead demonstrates similar
behavior. Low p99s except for periodic full timeouts. We enabled tracing
for several operations but were unable to get a trace that completed
successfully -- Cassandra started logging many messages as:

INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages were
dropped in last 5000 ms: 52499 for internal timeout and 0 for cross node
timeout

And all the traces contained rows with a "null" source_elapsed row:
https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out

We've exhausted as many configuration option permutations that we can think
of. This cluster does not appear to be under any significant load and
latencies seem to largely fall in two bands: low normal or max timeout.
This seems to imply that something is getting stuck and timing out at the
max write timeout.

Any suggestions on what to look for? We had debug enabled for awhile but we
didn't see any msg that pointed to something obvious. Happy to provide any
more information that may help.

We are pretty much at the point of sprinkling debug around the code to
track down what could be blocking.

Thanks,

Mike

Mike Heffner
Librato, Inc.

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-10 Thread Mike Heffner

Paulo,

Thanks for the suggestion, we ran some tests against CMS and saw the same
timeouts. On that note though, we are going to try doubling the instance
sizes and testing with double the heap (even though current usage is low).

Mike

On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta 
wrote:

> Are you using the same GC settings as the staging 2.0 cluster? If not,
> could you try using the default GC settings (CMS) and see if that changes
> anything? This is just a wild guess, but there were reports before of
> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403
> for more context). Please ignore if you already tried reverting back to CMS.
>
> 2016-02-10 16:51 GMT-03:00 Mike Heffner :
>
>> Hi all,
>>
>> We've recently embarked on a project to update our Cassandra
>> infrastructure running on EC2. We are long time users of 2.0.x and are
>> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
>> is a 3 node, RF=3 cluster supporting a small write load (mirror of our
>> staging load).
>>
>> We are writing at QUORUM and while p95's look good compared to our
>> staging 2.0.x cluster, we are seeing frequent write operations that time
>> out at the max write_request_timeout_in_ms (10 seconds). CPU across the
>> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running
>> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less
>> than 500ms.
>>
>> We run on c4.2xl instances with GP2 EBS attached storage for data and
>> commitlog directories. The nodes are using EC2 enhanced networking and have
>> the latest Intel network driver module. We are running on HVM instances
>> using Ubuntu 14.04.2.
>>
>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar
>> to the definition here:
>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>
>> This is our cassandra.yaml:
>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>>
>> Like I mentioned we use 8u60 with G1GC and have used many of the GC
>> settings in Al Tobey's tuning guide. This is our upstart config with JVM
>> and other CPU settings:
>> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>>
>> We've used several of the sysctl settings from Al's guide as well:
>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>
>> Our client application is able to write using either Thrift batches using
>> Asytanax driver or CQL async INSERT's using the Datastax Java driver.
>>
>> For testing against Thrift (our legacy infra uses this) we write batches
>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
>> around 45ms but our maximum (p100) sits less than 150ms except when it
>> periodically spikes to the full 10seconds.
>>
>> Testing the same write path using CQL writes instead demonstrates similar
>> behavior. Low p99s except for periodic full timeouts. We enabled tracing
>> for several operations but were unable to get a trace that completed
>> successfully -- Cassandra started logging many messages as:
>>
>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross
>> node timeout
>>
>> And all the traces contained rows with a "null" source_elapsed row:
>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>>
>>
>> We've exhausted as many configuration option permutations that we can
>> think of. This cluster does not appear to be under any significant load and
>> latencies seem to largely fall in two bands: low normal or max timeout.
>> This seems to imply that something is getting stuck and timing out at the
>> max write timeout.
>>
>> Any suggestions on what to look for? We had debug enabled for awhile but
>> we didn't see any msg that pointed to something obvious. Happy to provide
>> any more information that may help.
>>
>> We are pretty much at the point of sprinkling debug around the code to
>> track down what could be blocking.
>>
>>
>> Thanks,
>>
>> Mike
>>
>> --
>>
>>   Mike Heffner 
>>   Librato, Inc.
>>
>>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-10 Thread Mike Heffner

Jeff,

We have both commitlog and data on a 4TB EBS with 10k IOPS.

Mike

On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa 
wrote:

> What disk size are you using?
>
>
>
> From: Mike Heffner
> Reply-To: "user@cassandra.apache.org"
> Date: Wednesday, February 10, 2016 at 2:24 PM
> To: "user@cassandra.apache.org"
> Cc: Peter Norton
> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>
> Paulo,
>
> Thanks for the suggestion, we ran some tests against CMS and saw the same
> timeouts. On that note though, we are going to try doubling the instance
> sizes and testing with double the heap (even though current usage is low).
>
> Mike
>
> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta 
> wrote:
>
>> Are you using the same GC settings as the staging 2.0 cluster? If not,
>> could you try using the default GC settings (CMS) and see if that changes
>> anything? This is just a wild guess, but there were reports before of
>> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403
>> for more context). Please ignore if you already tried reverting back to CMS.
>>
>> 2016-02-10 16:51 GMT-03:00 Mike Heffner :
>>
>>> Hi all,
>>>
>>> We've recently embarked on a project to update our Cassandra
>>> infrastructure running on EC2. We are long time users of 2.0.x and are
>>> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
>>> is a 3 node, RF=3 cluster supporting a small write load (mirror of our
>>> staging load).
>>>
>>> We are writing at QUORUM and while p95's look good compared to our
>>> staging 2.0.x cluster, we are seeing frequent write operations that time
>>> out at the max write_request_timeout_in_ms (10 seconds). CPU across the
>>> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running
>>> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less
>>> than 500ms.
>>>
>>> We run on c4.2xl instances with GP2 EBS attached storage for data and
>>> commitlog directories. The nodes are using EC2 enhanced networking and have
>>> the latest Intel network driver module. We are running on HVM instances
>>> using Ubuntu 14.04.2.
>>>
>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar
>>> to the definition here:
>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>
>>> This is our cassandra.yaml:
>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>>>
>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC
>>> settings in Al Tobey's tuning guide. This is our upstart config with JVM
>>> and other CPU settings:
>>> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>>>
>>> We've used several of the sysctl settings from Al's guide as well:
>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>
>>> Our client application is able to write using either Thrift batches
>>> using Asytanax driver or CQL async INSERT's using the Datastax Java driver.
>>>
>>> For testing against Thrift (our legacy infra uses this) we write batches
>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
>>> around 45ms but our maximum (p100) sits less than 150ms except when it
>>> periodically spikes to the full 10seconds.
>>>
>>> Testing the same write path using CQL writes instead demonstrates
>>> similar behavior. Low p99s except for periodic full timeouts. We enabled
>>> tracing for several operations but were unable to get a trace that
>>> completed successfully -- Cassandra started logging many messages as:
>>>
>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
>>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross
>>> node timeout
>>>
>>> And all the traces contained rows with a "null" source_elapsed row:
>>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>>>
>>>
>>> We've exhausted as many configuration option permutations that we can
>>> think of. This cluster does not appear to be under any significant load and
>>> latencies seem to largely fall in two bands: low normal or max timeout.
>>> This seems to imply that something is getting stuck and timing out at the
>>> max write timeout.
>>>
>>> Any suggestions on what to look for? We had debug enabled for awhile but
>>> we didn't see any msg that pointed to something obvious. Happy to provide
>>> any more information that may help.
>>>
>>> We are pretty much at the point of sprinkling debug around the code to
>>> track down what could be blocking.
>>>
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> --
>>>
>>>   Mike Heffner 
>>>   Librato, Inc.
>>>
>>>
>>
>
>
> --
>
>   Mike Heffner 
>   Librato, Inc.
>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-17 Thread Mike Heffner

Jaydeep,

No, we don't use any light weight transactions.

Mike

On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
chovatia.jayd...@gmail.com> wrote:

> Are you guys using light weight transactions in your write path?
>
> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
> fabrice.faco...@gmail.com> wrote:
>
>> Are your commitlog and data on the same disk ? If yes, you should put
>> commitlogs on a separate disk which don't have a lot of IO.
>>
>> Others IO may have great impact impact on your commitlog writing and
>> it may even block.
>>
>> An example of impact IO may have, even for Async writes:
>>
>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>
>> 2016-02-11 0:31 GMT+01:00 Mike Heffner :
>> > Jeff,
>> >
>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>> >
>> > Mike
>> >
>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa > >
>> > wrote:
>> >>
>> >> What disk size are you using?
>> >>
>> >>
>> >>
>> >> From: Mike Heffner
>> >> Reply-To: "user@cassandra.apache.org"
>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>> >> To: "user@cassandra.apache.org"
>> >> Cc: Peter Norton
>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>> >>
>> >> Paulo,
>> >>
>> >> Thanks for the suggestion, we ran some tests against CMS and saw the
>> same
>> >> timeouts. On that note though, we are going to try doubling the
>> instance
>> >> sizes and testing with double the heap (even though current usage is
>> low).
>> >>
>> >> Mike
>> >>
>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta > >
>> >> wrote:
>> >>>
>> >>> Are you using the same GC settings as the staging 2.0 cluster? If not,
>> >>> could you try using the default GC settings (CMS) and see if that
>> changes
>> >>> anything? This is just a wild guess, but there were reports before of
>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>> CASSANDRA-10403
>> >>> for more context). Please ignore if you already tried reverting back
>> to CMS.
>> >>>
>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner :
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> We've recently embarked on a project to update our Cassandra
>> >>>> infrastructure running on EC2. We are long time users of 2.0.x and
>> are
>> >>>> testing out a move to version 2.2.5 running on VPC with EBS. Our
>> test setup
>> >>>> is a 3 node, RF=3 cluster supporting a small write load (mirror of
>> our
>> >>>> staging load).
>> >>>>
>> >>>> We are writing at QUORUM and while p95's look good compared to our
>> >>>> staging 2.0.x cluster, we are seeing frequent write operations that
>> time out
>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU across the
>> cluster
>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is running with
>> the
>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than
>> 500ms.
>> >>>>
>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for data and
>> >>>> commitlog directories. The nodes are using EC2 enhanced networking
>> and have
>> >>>> the latest Intel network driver module. We are running on HVM
>> instances
>> >>>> using Ubuntu 14.04.2.
>> >>>>
>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is
>> similar
>> >>>> to the definition here:
>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>> >>>>
>> >>>> This is our cassandra.yaml:
>> >>>>
>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>> >>>>
>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC
>> >>>> settings in Al Tobey's tuning guide. This is our upstart config with
>> JVM and
>> >>>> other CPU settings:
>> https://gist.github.com

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-18 Thread Mike Heffner

Following up from our earlier post...

We have continued to do exhaustive testing and measuring of the numerous
hardware and configuration variables here. What we have uncovered is that
on identical hardware (including the configuration we run in production),
something between versions 2.0.17 and 2.1.13 introduced this write timeout
for our workload. We still aren't any closer to identifying the what or
why, but it is easily reproduced using our workload when we bump to the
2.1.x release line.

At the moment we are going to focus on hardening this new hardware
configuration using the 2.0.17 release and roll it out internally to some
of our production rings. We also want to bisect the 2.1.x release line to
find if there was a particular point release that introduced the timeout.
If anyone has suggestions for particular changes to look out for we'd be
happy to focus a test on that earlier.

Thanks,

Mike

On Wed, Feb 10, 2016 at 2:51 PM, Mike Heffner  wrote:

> Hi all,
>
> We've recently embarked on a project to update our Cassandra
> infrastructure running on EC2. We are long time users of 2.0.x and are
> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
> is a 3 node, RF=3 cluster supporting a small write load (mirror of our
> staging load).
>
> We are writing at QUORUM and while p95's look good compared to our staging
> 2.0.x cluster, we are seeing frequent write operations that time out at the
> max write_request_timeout_in_ms (10 seconds). CPU across the cluster is <
> 10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle
> JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms.
>
> We run on c4.2xl instances with GP2 EBS attached storage for data and
> commitlog directories. The nodes are using EC2 enhanced networking and have
> the latest Intel network driver module. We are running on HVM instances
> using Ubuntu 14.04.2.
>
> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to
> the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>
> This is our cassandra.yaml:
> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>
> Like I mentioned we use 8u60 with G1GC and have used many of the GC
> settings in Al Tobey's tuning guide. This is our upstart config with JVM
> and other CPU settings:
> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>
> We've used several of the sysctl settings from Al's guide as well:
> https://gist.github.com/mheffner/ea40d58f58a517028152
>
> Our client application is able to write using either Thrift batches using
> Asytanax driver or CQL async INSERT's using the Datastax Java driver.
>
> For testing against Thrift (our legacy infra uses this) we write batches
> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
> around 45ms but our maximum (p100) sits less than 150ms except when it
> periodically spikes to the full 10seconds.
>
> Testing the same write path using CQL writes instead demonstrates similar
> behavior. Low p99s except for periodic full timeouts. We enabled tracing
> for several operations but were unable to get a trace that completed
> successfully -- Cassandra started logging many messages as:
>
> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross
> node timeout
>
> And all the traces contained rows with a "null" source_elapsed row:
> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>
>
> We've exhausted as many configuration option permutations that we can
> think of. This cluster does not appear to be under any significant load and
> latencies seem to largely fall in two bands: low normal or max timeout.
> This seems to imply that something is getting stuck and timing out at the
> max write timeout.
>
> Any suggestions on what to look for? We had debug enabled for awhile but
> we didn't see any msg that pointed to something obvious. Happy to provide
> any more information that may help.
>
> We are pretty much at the point of sprinkling debug around the code to
> track down what could be blocking.
>
>
> Thanks,
>
> Mike
>
> --
>
>   Mike Heffner 
>   Librato, Inc.
>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-18 Thread Mike Heffner

Alain,

Thanks for the suggestions.

Sure, tpstats are here:
https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
metrics across the ring, there were no blocked tasks nor dropped messages.

Iowait metrics look fine, so it doesn't appear to be blocking on disk.
Similarly, there are no long GC pauses.

We haven't noticed latency on any particular table higher than others or
correlated around the occurrence of a timeout. We have noticed with further
testing that running cassandra-stress against the ring, while our workload
is writing to the same ring, will incur similar 10 second timeouts. If our
workload is not writing to the ring, cassandra stress will run without
hitting timeouts. This seems to imply that our workload pattern is causing
something to block cluster-wide, since the stress tool writes to a
different keyspace then our workload.

I mentioned in another reply that we've tracked it to something between
2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
introduced in.

Cheers,

Mike

On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ  wrote:

> Hi Mike,
>
> What about the output of tpstats ? I imagine you have dropped messages
> there. Any blocked threads ? Could you paste this output here ?
>
> May this be due to some network hiccup to access the disks as they are EBS
> ? Can you think of anyway of checking this ? Do you have a lot of GC logs,
> how long are the pauses (use something like: grep -i 'GCInspector'
> /var/log/cassandra/system.log) ?
>
> Something else you could check are local_writes stats to see if only one
> table if affected or this is keyspace / cluster wide. You can use metrics
> exposed by cassandra or if you have no dashboards I believe a: 'nodetool
> cfstats  | grep -e 'Table:' -e 'Local'' should give you a rough idea
> of local latencies.
>
> Those are just things I would check, I have not a clue on what is
> happening here, hope this will help.
>
> C*heers,
> -
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-18 5:13 GMT+01:00 Mike Heffner :
>
>> Jaydeep,
>>
>> No, we don't use any light weight transactions.
>>
>> Mike
>>
>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Are you guys using light weight transactions in your write path?
>>>
>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>> fabrice.faco...@gmail.com> wrote:
>>>
>>>> Are your commitlog and data on the same disk ? If yes, you should put
>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>
>>>> Others IO may have great impact impact on your commitlog writing and
>>>> it may even block.
>>>>
>>>> An example of impact IO may have, even for Async writes:
>>>>
>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>
>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner :
>>>> > Jeff,
>>>> >
>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>> >
>>>> > Mike
>>>> >
>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>> jeff.ji...@crowdstrike.com>
>>>> > wrote:
>>>> >>
>>>> >> What disk size are you using?
>>>> >>
>>>> >>
>>>> >>
>>>> >> From: Mike Heffner
>>>> >> Reply-To: "user@cassandra.apache.org"
>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>> >> To: "user@cassandra.apache.org"
>>>> >> Cc: Peter Norton
>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>> >>
>>>> >> Paulo,
>>>> >>
>>>> >> Thanks for the suggestion, we ran some tests against CMS and saw the
>>>> same
>>>> >> timeouts. On that note though, we are going to try doubling the
>>>> instance
>>>> >> sizes and testing with double the heap (even though current usage is
>>>> low).
>>>> >>
>>>> >> Mike
>>>> >>
>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>> pauloricard...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Are you using the same GC settings a

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-19 Thread Mike Heffner

Anuj,

So we originally started testing with Java8 + G1, however we were able to
reproduce the same results with the default CMS settings that ship in the
cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
during the runs.

Query pattern during our testing was 100% writes, batching (via Thrift
mostly) to 5 tables, between 6-1500 rows per batch.

Mike

On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra 
wrote:

> Whats the GC overhead? Can you your share your GC collector and settings ?
>
>
> Whats your query pattern? Do you use secondary indexes, batches, in clause
> etc?
>
>
> Anuj
>
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>  wrote:
> Alain,
>
> Thanks for the suggestions.
>
> Sure, tpstats are here:
> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
> metrics across the ring, there were no blocked tasks nor dropped messages.
>
> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
> Similarly, there are no long GC pauses.
>
> We haven't noticed latency on any particular table higher than others or
> correlated around the occurrence of a timeout. We have noticed with further
> testing that running cassandra-stress against the ring, while our workload
> is writing to the same ring, will incur similar 10 second timeouts. If our
> workload is not writing to the ring, cassandra stress will run without
> hitting timeouts. This seems to imply that our workload pattern is causing
> something to block cluster-wide, since the stress tool writes to a
> different keyspace then our workload.
>
> I mentioned in another reply that we've tracked it to something between
> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
> introduced in.
>
> Cheers,
>
> Mike
>
> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ 
> wrote:
>
>> Hi Mike,
>>
>> What about the output of tpstats ? I imagine you have dropped messages
>> there. Any blocked threads ? Could you paste this output here ?
>>
>> May this be due to some network hiccup to access the disks as they are
>> EBS ? Can you think of anyway of checking this ? Do you have a lot of GC
>> logs, how long are the pauses (use something like: grep -i 'GCInspector'
>> /var/log/cassandra/system.log) ?
>>
>> Something else you could check are local_writes stats to see if only one
>> table if affected or this is keyspace / cluster wide. You can use metrics
>> exposed by cassandra or if you have no dashboards I believe a: 'nodetool
>> cfstats  | grep -e 'Table:' -e 'Local'' should give you a rough idea
>> of local latencies.
>>
>> Those are just things I would check, I have not a clue on what is
>> happening here, hope this will help.
>>
>> C*heers,
>> -
>> Alain Rodriguez
>> France
>>
>> The Last Pickle
>> http://www.thelastpickle.com
>>
>> 2016-02-18 5:13 GMT+01:00 Mike Heffner :
>>
>>> Jaydeep,
>>>
>>> No, we don't use any light weight transactions.
>>>
>>> Mike
>>>
>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> Are you guys using light weight transactions in your write path?
>>>>
>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>> fabrice.faco...@gmail.com> wrote:
>>>>
>>>>> Are your commitlog and data on the same disk ? If yes, you should put
>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>
>>>>> Others IO may have great impact impact on your commitlog writing and
>>>>> it may even block.
>>>>>
>>>>> An example of impact IO may have, even for Async writes:
>>>>>
>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>
>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner :
>>>>> > Jeff,
>>>>> >
>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>> >
>>>>> > Mike
>>>>> >
>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>> jeff.ji...@crowdstrike.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> What disk size are you using?

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-24 Thread Mike Heffner

Nate,

So we have run several install tests, bisecting the 2.1.x release line, and
we believe that the regression was introduced in version 2.1.5. This is the
first release that clearly hits the timeout for us.

It looks like quite a large release, so our next step will likely be
bisecting the major commits to see if we can narrow it down:
https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92b5b3494d8c/CHANGES.txt.
Obviously, any suggestions on potential suspects appreciated.

These are the memtable settings we've configured diff from the defaults
during our testing:

memtable_allocation_type: offheap_objects
memtable_flush_writers: 8


Cheers,

Mike

On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall  wrote:

> The biggest change which *might* explain your behavior has to do with the
> changes in memtable flushing between 2.0 and 2.1:
> https://issues.apache.org/jira/browse/CASSANDRA-5549
>
> However, the tpstats you posted shows no dropped mutations which would
> make me more certain of this as the cause.
>
> What values do you have right now for each of these (my recommendations
> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>
> - memtable_flush_writers (2)
> - memtable_heap_space_in_mb  (2048)
> - memtable_offheap_space_in_mb (2048)
> - memtable_cleanup_threshold (0.11)
> - memtable_allocation_type (offheap_objects)
>
> The biggest win IMO will be moving to offheap_objects. By default,
> everything is on heap. Regardless, spending some time tuning these for your
> workload will pay off.
>
> You may also want to be explicit about
>
> - native_transport_max_concurrent_connections
> - native_transport_max_concurrent_connections_per_ip
>
> Depending on the driver, these may now be allowing 32k streams per
> connection(!) as detailed in v3 of the native protocol:
>
> https://github.com/apache/cassandra/blob/cassandra-2.1/doc/native_protocol_v3.spec#L130-L152
>
>
>
> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner  wrote:
>
>> Anuj,
>>
>> So we originally started testing with Java8 + G1, however we were able to
>> reproduce the same results with the default CMS settings that ship in the
>> cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
>> during the runs.
>>
>> Query pattern during our testing was 100% writes, batching (via Thrift
>> mostly) to 5 tables, between 6-1500 rows per batch.
>>
>> Mike
>>
>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra 
>> wrote:
>>
>>> Whats the GC overhead? Can you your share your GC collector and settings
>>> ?
>>>
>>>
>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>> clause etc?
>>>
>>>
>>> Anuj
>>>
>>>
>>> Sent from Yahoo Mail on Android
>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>
>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>>  wrote:
>>> Alain,
>>>
>>> Thanks for the suggestions.
>>>
>>> Sure, tpstats are here:
>>> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
>>> metrics across the ring, there were no blocked tasks nor dropped messages.
>>>
>>> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
>>> Similarly, there are no long GC pauses.
>>>
>>> We haven't noticed latency on any particular table higher than others or
>>> correlated around the occurrence of a timeout. We have noticed with further
>>> testing that running cassandra-stress against the ring, while our workload
>>> is writing to the same ring, will incur similar 10 second timeouts. If our
>>> workload is not writing to the ring, cassandra stress will run without
>>> hitting timeouts. This seems to imply that our workload pattern is causing
>>> something to block cluster-wide, since the stress tool writes to a
>>> different keyspace then our workload.
>>>
>>> I mentioned in another reply that we've tracked it to something between
>>> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
>>> introduced in.
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ 
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> What about the output of tpstats ? I imagine you have dropped messages
>>>> there. Any blocked threads ? Could you paste this output here ?
>>>>
>>>> May this be due to some network hiccup to access the dis

Re: Consistent read timeouts for bursts of reads

2016-03-03 Thread Mike Heffner

Emils,

I realize this may be a big downgrade, but are you timeouts reproducible
under Cassandra 2.1.4?

Mike

On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis 
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Consistent read timeouts for bursts of reads

2016-03-04 Thread Mike Heffner

Emils,

We believe we've tracked it down to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5.

We are running a build of 2.2.5 with that patch and so far have not seen
any more timeouts.

Mike

On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis 
wrote:

> Mike,
>
> Is that where you've bisected it to having been introduced?
>
> I'll see what I can do, but doubt it, since we've long since upgraded prod
> to 2.2.4 (and stage before that) and the tests I'm running were for a new
> feature.
>
>
> On Fri, 4 Mar 2016 03:54 Mike Heffner,  wrote:
>
>> Emils,
>>
>> I realize this may be a big downgrade, but are you timeouts reproducible
>> under Cassandra 2.1.4?
>>
>> Mike
>>
>> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis <
>> emils.solma...@gmail.com> wrote:
>>
>>> Having had a read through the archives, I missed this at first, but this
>>> seems to be *exactly* like what we're experiencing.
>>>
>>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>>>
>>> Only difference is we're getting this for reads and using CQL, but the
>>> behaviour is identical.
>>>
>>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We're having a problem with concurrent requests. It seems that whenever
>>>> we try resolving more
>>>> than ~ 15 queries at the same time, one or two get a read timeout and
>>>> then succeed on a retry.
>>>>
>>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>>>> AWS.
>>>>
>>>> What we've found while investigating:
>>>>
>>>>  * this is not db-wide. Trying the same pattern against another table
>>>> everything works fine.
>>>>  * it fails 1 or 2 requests regardless of how many are executed in
>>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>>>> requests and doesn't seem to scale up.
>>>>  * the problem is consistently reproducible. It happens both under
>>>> heavier load and when just firing off a single batch of requests for
>>>> testing.
>>>>  * tracing the faulty requests says everything is great. An example
>>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>>>  * the only peculiar thing in the logs is there's no acknowledgement of
>>>> the request being accepted by the server, as seen in
>>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>>>  * there's nothing funny in the timed out Cassandra node's logs around
>>>> that time as far as I can tell, not even in the debug logs.
>>>>
>>>> Any ideas about what might be causing this, pointers to server config
>>>> options, or how else we might debug this would be much appreciated.
>>>>
>>>> Kind regards,
>>>> Emils
>>>>
>>>>
>>
>>
>> --
>>
>>   Mike Heffner 
>>   Librato, Inc.
>>
>>


-- 

  Mike Heffner 
  Librato, Inc.

Migrating data from a 0.8.8 -> 1.1.2 ring

2012-07-23 Thread Mike Heffner

Hi,

We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing
missing data post-migration. We use pre-built/configured AMIs so our
preferred route is to leave our existing production 0.8.8 untouched and
bring up a parallel 1.1.2 ring and migrate data into it. Data is written to
the rings via batch processes so we can easily assure that both the
existing and new rings will have the same data post migration.

The ring we are migrating from is:

  * 12 nodes
  * single data-center, 3 AZs
  * 0.8.8

The ring we are migrating to is the same except 1.1.2.

The steps we are taking are:

1. Bring up a 1.1.2 ring in the same AZ/data center configuration with
tokens matching the corresponding nodes in the 0.8.8 ring.
2. Create the same keyspace on 1.1.2.
3. Create each CF in the keyspace on 1.1.2.
4. Flush each node of the 0.8.8 ring.
5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in
1.1.2.
6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming
the file to the  /cassandra/data///-... format.
For example, for the keyspace "Metrics" and CF "epochs_60" we get:
"cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db".
7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics ` for
each CF in the keyspace. We notice that storage load jumps accordingly.
8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This
takes awhile but appears to correctly rewrite each sstable in the new 1.1.x
format. Storage load drops as sstables are compressed.

After these steps we run a script that validates data on the new ring. What
we've noticed is that large portions of the data that was on the 0.8.8 is
not available on the 1.1.2 ring. We've tried reading at both quorum and
ONE, but the resulting data appears missing in both cases.

We have fewer than 143 million row keys in the CFs we're testing and none
of the *-Filter.db files are > 10MB, so I don't believe this is our
problem: https://issues.apache.org/jira/browse/CASSANDRA-3820

Anything else to test verify? Are the steps above correct for this type of
upgrade? Is this type of upgrade/migration supported?

We have also tried running a repair across the cluster after step #8. While
it took a few retries due to
https://issues.apache.org/jira/browse/CASSANDRA-4456, we still had missing
data afterwards.

Any assistance would be appreciated.


Thanks!

Mike

-- 

  Mike Heffner 
  Librato, Inc.

Re: Migrating data from a 0.8.8 -> 1.1.2 ring

2012-07-24 Thread Mike Heffner

On Mon, Jul 23, 2012 at 1:25 PM, Mike Heffner  wrote:

> Hi,
>
> We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing
> missing data post-migration. We use pre-built/configured AMIs so our
> preferred route is to leave our existing production 0.8.8 untouched and
> bring up a parallel 1.1.2 ring and migrate data into it. Data is written to
> the rings via batch processes so we can easily assure that both the
> existing and new rings will have the same data post migration.
>
> 


> The steps we are taking are:
>
> 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with
> tokens matching the corresponding nodes in the 0.8.8 ring.
> 2. Create the same keyspace on 1.1.2.
> 3. Create each CF in the keyspace on 1.1.2.
> 4. Flush each node of the 0.8.8 ring.
> 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node
> in 1.1.2.
> 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming
> the file to the  /cassandra/data///-... format.
> For example, for the keyspace "Metrics" and CF "epochs_60" we get:
> "cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db".
> 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics ` for
> each CF in the keyspace. We notice that storage load jumps accordingly.
> 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This
> takes awhile but appears to correctly rewrite each sstable in the new 1.1.x
> format. Storage load drops as sstables are compressed.
>
>
So, after some further testing we've observed that the `upgradesstables`
command is removing data from the sstables, leading to our missing data.
We've repeated the steps above with several variations:

WORKS refresh -> scrub
WORKS refresh -> scrub -> major compaction

FAILS refresh -> upgradesstables
FAILS refresh -> scrub -> upgradesstables
FAILS refresh -> scrub -> major compaction -> upgradesstables

So, we are able to migrate our test CFs from a 0.8.8 ring to a 1.1.2 ring
when we use scrub. However, whenever we run an upgradesstables command the
sstables are shrunk significantly and our tests show missing data:

 INFO [CompactionExecutor:4] 2012-07-24 04:27:36,837 CompactionTask.java
(line 109) Compacting
[SSTableReader(path='/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-51-Data.db')]
 INFO [CompactionExecutor:4] 2012-07-24 04:27:51,090 CompactionTask.java
(line 221) Compacted to
[/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-58-Data.db,].
 60,449,155 to 2,578,102 (~4% of original) bytes for 4,002 keys at
0.172562MB/s.  Time: 14,248ms.

Is there a scenario where upgradesstables would remove data that a scrub
command wouldn't? According the documentation, it would appear that the
scrub command is actually more destructive than upgradesstables in terms of
removing data. On 1.1.x, upgradesstables is the documented upgrade command
over a scrub.

The keyspace is defined as:

Keyspace: Metrics:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
  Durable Writes: true
Options: [us-east:3]

And the column family above defined as:

ColumnFamily: metrics_900
  Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Default column value validator:
org.apache.cassandra.db.marshal.BytesType
  Columns sorted by:
org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
  GC grace seconds: 0
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.1
  DC Local Read repair chance: 0.0
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: default
  Built indexes: []
  Compaction Strategy:
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  Compression Options:
sstable_compression:
org.apache.cassandra.io.compress.SnappyCompressor

All rows have a TTL of 30 days, so it's possible that, along with the
gc_grace=0, a small number would be removed during a
compaction/scrub/upgradesstables step. However, the majority should still
be kept as their TTL has not expired yet.

We are still experimenting to see under what conditions this happens, but I
thought I'd send out some more info in case there is something clearly
wrong we're doing here.


Thanks,

Mike
-- 

  Mike Heffner 
  Librato, Inc.

Composite Column Slice query, wildcard first component?

2012-08-15 Thread Mike Hugo

Hello,

Given a row like this

"key1" => (A:A:C), (A:A:B), (B:A:C), (B:C:D)

Is there a way to create a slice query that returns all columns where the
_second_ component is A?  That is, I would like to get back the following
columns by asking for columns where component[0] = * and component[1] = A

(A:A:C), (A:A:B), (B:A:C)

I could do some iteration and figure this out in more of a brute force
manner, I'm just curious if there's anything built in that might be more
efficient

Thanks!

Mike

Re: Hinted Handoff runs every ten minutes

2012-11-08 Thread Mike Heffner

Is there a ticket open for this for 1.1.6?

We also noticed this after upgrading from 1.1.3 to 1.1.6. Every node runs a
0 row hinted handoff every 10 minutes. N-1 nodes hint to the same node,
while that node hints to another node.


On Tue, Oct 30, 2012 at 1:35 PM, Vegard Berget  wrote:

> Hi,
>
> I have the exact same problem with 1.1.6.  HintsColumnFamily consists of
> one row (Rowkey 00, nothing more).   The "problem" started after upgrading
> from 1.1.4 to 1.1.6.  Every ten minutes HintedHandoffManager starts and
> finishes  after sending "0 rows".
>
> .vegard,
>
>
>
> - Original Message -
> From:
> user@cassandra.apache.org
>
> To:
> 
> Cc:
>
> Sent:
> Mon, 29 Oct 2012 23:45:30 +0100
>
> Subject:
> Re: Hinted Handoff runs every ten minutes
>
>
> Dne 29.10.2012 23:24, Stephen Pierce napsal(a):
> > I'm running 1.1.5; the bug says it's fixed in 1.0.9/1.1.0.
> >
> > How can I check to see why it keeps running HintedHandoff?
> you have tombstone is system.HintsColumnFamily use list command in
> cassandra-cli to check
>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Upgrade 1.1.2 -> 1.1.6

2012-11-19 Thread Mike Heffner

Alain,

We performed a 1.1.3 -> 1.1.6 upgrade and found that all the logs replayed
regardless of the drain. After noticing this on the first node, we did the
following:

* nodetool flush
* nodetool drain
* service cassandra stop
* mv /path/to/logs/*.log /backup/
* apt-get install cassandra


I also agree that starting C* after an upgrade/install seems quite broken
if it was already stopped before the install. However annoying, I have
found this to be the default for most Ubuntu daemon packages.

Mike


On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ  wrote:

> We had an issue with counters over-counting even using the nodetool drain
> command before upgrading...
>
> Here is my bash history
>
>69  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
>70  cp /etc/cassandra/cassandra-env.sh
> /etc/cassandra/cassandra-env.sh.bak
>71  sudo apt-get install cassandra
>72  nodetool disablethrift
>73  nodetool drain
>74  service cassandra stop
>75  cat /etc/cassandra/cassandra-env.sh
> /etc/cassandra/cassandra-env.sh.bak
>76  vim /etc/cassandra/cassandra-env.sh
>77  cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
>78  vim /etc/cassandra/cassandra.yaml
>79  service cassandra start
>
> So I think I followed these steps
> http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps
>
> I merged my conf files with an external tool so consider I merged my conf
> files on steps 76 and 78.
>
> I saw that the "sudo apt-get install cassandra" stop the server and
> restart it automatically. So it updated without draining and restart before
> I had the time to reconfigure the conf files. Is this "normal" ? Is there a
> way to avoid it ?
>
> So for the second node I decided to try to stop C*before the upgrade.
>
>   125  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
>   126  cp /etc/cassandra/cassandra-env.sh
> /etc/cassandra/cassandra-env.sh.bak
>   127  nodetool disablegossip
>   128  nodetool disablethrift
>   129  nodetool drain
>   130  service cassandra stop
>   131  sudo apt-get install cassandra
>
> //131 : This restarted cassandra
>
>   132  nodetool disablethrift
>   133  nodetool disablegossip
>   134  nodetool drain
>   135  service cassandra stop
>   136  cat /etc/cassandra/cassandra-env.sh
> /etc/cassandra/cassandra-env.sh.bak
>   137  cim /etc/cassandra/cassandra-env.sh
>   138  vim /etc/cassandra/cassandra-env.sh
>   139  cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
>   140  vim /etc/cassandra/cassandra.yaml
>   141  service cassandra start
>
> After both of these updates I saw my current counters increase without any
> reason.
>
> Did I do anything wrong ?
>
> Alain
>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Upgrade 1.1.2 -> 1.1.6

2012-11-20 Thread Mike Heffner

Alain,

My understanding is that drain ensures that all memtables are flushed, so
that there is no data in the commitlog that is isn't in an sstable. A
marker is saved that indicates the commit logs should not be replayed.
Commitlogs are only removed from disk periodically
(after commitlog_total_space_in_mb is exceeded?).

With 1.1.5/6, all nanotime commitlogs are replayed on startup regardless of
whether they've been flushed. So in our case manually removing all the
commitlogs after a drain was the only way to prevent their replay.

Mike




On Tue, Nov 20, 2012 at 5:19 AM, Alain RODRIGUEZ  wrote:

> @Mike
>
> I am glad to see I am not the only one with this issue (even if I am sorry
> it happened to you of course.).
>
> Isn't drain supposed to clear the commit logs ? Did removing them worked
> properly ?
>
> I his warning to C* users, Jonathan Ellis told that a drain would avoid
> this issue, It seems like it doesn't.
>
> @Rob
>
> You understood precisely the 2 issues I met during the upgrade. I am sad
> to see none of them is yet resolved and probably wont.
>
>
> 2012/11/20 Mike Heffner 
>
>> Alain,
>>
>> We performed a 1.1.3 -> 1.1.6 upgrade and found that all the logs
>> replayed regardless of the drain. After noticing this on the first node, we
>> did the following:
>>
>> * nodetool flush
>> * nodetool drain
>> * service cassandra stop
>> * mv /path/to/logs/*.log /backup/
>> * apt-get install cassandra
>> 
>>
>> I also agree that starting C* after an upgrade/install seems quite broken
>> if it was already stopped before the install. However annoying, I have
>> found this to be the default for most Ubuntu daemon packages.
>>
>> Mike
>>
>>
>> On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ wrote:
>>
>>> We had an issue with counters over-counting even using the nodetool
>>> drain command before upgrading...
>>>
>>> Here is my bash history
>>>
>>>69  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
>>>70  cp /etc/cassandra/cassandra-env.sh
>>> /etc/cassandra/cassandra-env.sh.bak
>>>71  sudo apt-get install cassandra
>>>72  nodetool disablethrift
>>>73  nodetool drain
>>>74  service cassandra stop
>>>75  cat /etc/cassandra/cassandra-env.sh
>>> /etc/cassandra/cassandra-env.sh.bak
>>>76  vim /etc/cassandra/cassandra-env.sh
>>>77  cat /etc/cassandra/cassandra.yaml
>>> /etc/cassandra/cassandra.yaml.bak
>>>78  vim /etc/cassandra/cassandra.yaml
>>>79  service cassandra start
>>>
>>> So I think I followed these steps
>>> http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps
>>>
>>> I merged my conf files with an external tool so consider I merged my
>>> conf files on steps 76 and 78.
>>>
>>> I saw that the "sudo apt-get install cassandra" stop the server and
>>> restart it automatically. So it updated without draining and restart before
>>> I had the time to reconfigure the conf files. Is this "normal" ? Is there a
>>> way to avoid it ?
>>>
>>> So for the second node I decided to try to stop C*before the upgrade.
>>>
>>>   125  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
>>>   126  cp /etc/cassandra/cassandra-env.sh
>>> /etc/cassandra/cassandra-env.sh.bak
>>>   127  nodetool disablegossip
>>>   128  nodetool disablethrift
>>>   129  nodetool drain
>>>   130  service cassandra stop
>>>   131  sudo apt-get install cassandra
>>>
>>> //131 : This restarted cassandra
>>>
>>>   132  nodetool disablethrift
>>>   133  nodetool disablegossip
>>>   134  nodetool drain
>>>   135  service cassandra stop
>>>   136  cat /etc/cassandra/cassandra-env.sh
>>> /etc/cassandra/cassandra-env.sh.bak
>>>   137  cim /etc/cassandra/cassandra-env.sh
>>>   138  vim /etc/cassandra/cassandra-env.sh
>>>   139  cat /etc/cassandra/cassandra.yaml
>>> /etc/cassandra/cassandra.yaml.bak
>>>   140  vim /etc/cassandra/cassandra.yaml
>>>   141  service cassandra start
>>>
>>> After both of these updates I saw my current counters increase without
>>> any reason.
>>>
>>> Did I do anything wrong ?
>>>
>>> Alain
>>>
>>>
>>
>>
>> --
>>
>>   Mike Heffner 
>>   Librato, Inc.
>>
>>
>>
>


-- 

  Mike Heffner 
  Librato, Inc.

Re: Upgrade 1.1.2 -> 1.1.6

2012-11-20 Thread Mike Heffner

On Tue, Nov 20, 2012 at 2:49 PM, Rob Coli  wrote:

> On Mon, Nov 19, 2012 at 7:18 PM, Mike Heffner  wrote:
> > We performed a 1.1.3 -> 1.1.6 upgrade and found that all the logs
> replayed
> > regardless of the drain.
>
> Your experience and desire for different (expected) behavior is welcomed
> on :
>
> https://issues.apache.org/jira/browse/CASSANDRA-4446
>
> "nodetool drain sometimes doesn't mark commitlog fully flushed"
>
> If every production operator who experiences this issue shares their
> experience on this bug, perhaps the project will acknowledge and
> address it.
>
>
Well in this case I think our issue was that upgrading from nanotime->epoch
seconds, by definition, replays all commit logs. That's not due to any
specific problem with nodetool drain not marking commitlog's flushed, but a
safety to ensure data is not lost due to buggy nanotime implementations.

For us, it was that the upgrade instructions pre-1.1.5->1.1.6 didn't
mention that CL's should be removed if successfully drained. On the other
hand, we do not use counters so replaying them was merely a much longer
MTT-Return after restarting with 1.1.6.

Mike

-- 

  Mike Heffner 
  Librato, Inc.

Does a scrub remove deleted/expired columns?

2012-12-13 Thread Mike Smith

I'm using 1.0.12 and I find that large sstables tend to get compacted
infrequently. I've got data that gets deleted or expired frequently. Is it
possible to use scrub to accelerate the clean up of expired/deleted data?

-- 
Mike Smith
Director Development, MailChannels

Re: Does a scrub remove deleted/expired columns?

2012-12-13 Thread Mike Smith

Thanks for the great explanation.

I'd just like some clarification on the last point. Is it the case that if
I constantly add new columns to a row, while periodically trimming the row
by by deleting the oldest columns, the deleted columns won't get cleaned up
until all fragments of the row exist in a single sstable and that sstable
undergoes a compaction?

If my understanding is correct, do you know if 1.2 will enable cleanup of
columns in rows that have scattered fragments? Or, should I take a
different approach?



On Thu, Dec 13, 2012 at 5:52 PM, aaron morton wrote:

>  Is it possible to use scrub to accelerate the clean up of expired/deleted
> data?
>
> No.
> Scrub, and upgradesstables, are used to re-write each file on disk. Scrub
> may remove some rows from a file because of corruption, however
> upgradesstables will not.
>
> If you have long lived rows and a mixed work load of writes and deletes
> there are a couple of options.
>
> You can try levelled compaction
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>
> You can tune the default sized tiered compaction by increasing the
> min_compaction_threshold. This will increase the number of files that must
> exist in each size tier before it will be compacted. As a result the speed
> at which rows move into the higher tiers will slow down.
>
> Note that having lots of files may have a negative impact on read
> performance. You can measure this my looking at the SSTables per read
> metric in the cfhistograms.
>
> Lastly you can run a user defined or major compaction. User defined
> compaction is available via JMX and allows you to compact any file you
> want. Manual / major compaction is available via node tool. We usually
> discourage it's use as it will create one big file that will not get
> compacted for a while.
>
>
> For background the tombstones / expired columns for a row are only purged
> from the database when all fragments of the row are  in the files been
> compacted. So if you have an old row that is spread out over many files it
> may not get purged.
>
> Hope that helps.
>
>
>
>-
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/12/2012, at 3:01 AM, Mike Smith  wrote:
>
> I'm using 1.0.12 and I find that large sstables tend to get compacted
> infrequently. I've got data that gets deleted or expired frequently. Is it
> possible to use scrub to accelerate the clean up of expired/deleted data?
>
> --
> Mike Smith
> Director Development, MailChannels
>
>
>


-- 
Mike Smith
Director Development, MailChannels

CQL3 Blob Value Literal?

2013-01-10 Thread Mike Sample

Does CQL3 support blob/BytesType literals for INSERT, UPDATE etc commands?

I looked at the CQL3 syntax (http://cassandra.apache.org/doc/cql3/CQL.html)
and at the DataStax 1.2 docs.

As for why I'd want such a thing, I just wanted to initialize some test
values for a blob column with cqlsh.

Thanks!

Re: Node selection when both partition key and secondary index field constrained?

2013-01-28 Thread Mike Sample

Thanks Aaron.   So basically it's merging the results 2 separate queries:
Indexed scan (token-range) intersect foo.flag_index=true where the
latter query hits the entire cluster as per the secondary index FAQ
entry.   Thus the overall query would fail if LOCAL_QUORUM was requested,
RF=3 and 2 nodes in a given replication group were down. Darn.  Is there
any way of efficiently getting around this (ie scope the query to just the
nodes in the token range)?




On Mon, Jan 28, 2013 at 11:44 AM, aaron morton wrote:

> It uses the index...
>
> cqlsh:dev> tracing on;
> Now tracing requests.
> cqlsh:dev>
> cqlsh:dev>
> cqlsh:dev> SELECT id, flag from foo WHERE TOKEN(id) > '-9939393' AND
> TOKEN(id) <= '0' AND flag=true;
>
> Tracing session: 128cab90-6982-11e2-8cd1-51eaa232562e
>
>  activity   | timestamp|
> source| source_elapsed
>
> +--+---+
>  execute_cql3_query | 08:36:55,244 |
> 127.0.0.1 |  0
>   Parsing statement | 08:36:55,244 |
> 127.0.0.1 |600
>  Peparing statement | 08:36:55,245 |
> 127.0.0.1 |   1408
>   Determining replicas to query | 08:36:55,246 |
> 127.0.0.1 |   1924
>  Executing indexed scan for (max(-9939393), max(0)] | 08:36:55,247 |
> 127.0.0.1 |   2956
>  Executing single-partition query on foo.flag_index | 08:36:55,247 |
> 127.0.0.1 |   3192
>Acquiring sstable references | 08:36:55,247 |
> 127.0.0.1 |   3220
>   Merging memtable contents | 08:36:55,247 |
> 127.0.0.1 |   3265
>Scanned 0 rows and matched 0 | 08:36:55,247 |
> 127.0.0.1 |   3396
>Request complete | 08:36:55,247 |
> 127.0.0.1 |   3644
>
>
> It reads from the secondary index and discards keys that are outside of
> the token range.
>
> Cheers
>
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 28/01/2013, at 4:24 PM, Mike Sample  wrote:
>
> > Does the following FAQ entry hold even when the partion key is also
> constrained in the query (by token())?
> >
> > http://wiki.apache.org/cassandra/SecondaryIndexes:
> > ==
> >Q: How does choice of Consistency Level affect cluster availability
> when using secondary indexes?
> >
> >A: Because secondary indexes are distributed, you must have CL nodes
> available for all token ranges in the cluster in order to complete a query.
> For example, with RF = 3, when two out of three consecutive nodes in the
> ring are unavailable, all secondary index queries at CL = QUORUM will fail,
> however secondary index queries at CL = ONE will succeed. This is true
> regardless of cluster size."
> > ==
> >
> > For example:
> >
> > CREATE TABLE foo (
> > id uuid,
> > seq_num bigint,
> > flag boolean,
> > some_other_data blob,
> > PRIMARY KEY (id,seq_num)
> > );
> >
> > CREATE INDEX flag_index ON foo (flag);
> >
> > SELECT id, flag from foo WHERE TOKEN(id) > '-9939393' AND TOKEN(id) <=
> '0' AND flag=true;
> >
> > Would the above query with LOCAL_QUORUM succeed given the following? IE
> is the token range used first trim node selection?
> >
> > * the cluster has 18 nodes
> > * foo is in a keyspace with a replication factor of 3 for that data
> center
> > * 2 nodes in one of the replication groups are down
> > * the token range in the query is not in the range of the down nodes
> >
> >
> > Thanks in advance!
>
>

CQL3 PreparedStatement - parameterized timestamp

2013-02-06 Thread Mike Sample

Is there a way to re-use a prepared statement with different "using
timestamp" values?

BEGIN BATCH USING 
INSERT INTO Foo (a,b,c) values (?,?,?)
...
APPLY BATCH;

Once bound or while binding the prepared statement to specific values, I'd
like to set the timestamp value.

Putting a question mark in for timestamp failed as expected and I don't see
a method on the DataStax java driver BoundStatement for setting it.

Thanks in advance.

/Mike Sample

Re: CQL3 PreparedStatement - parameterized timestamp

2013-02-06 Thread Mike Sample

Thanks Sylvain.  I should have scanned Jira first.  Glad to see it's on the
todo list.


On Wed, Feb 6, 2013 at 12:24 AM, Sylvain Lebresne wrote:

> Not yet: https://issues.apache.org/jira/browse/CASSANDRA-4450
>
> --
> Sylvain
>
>
> On Wed, Feb 6, 2013 at 9:06 AM, Mike Sample  wrote:
>
>> Is there a way to re-use a prepared statement with different "using
>> timestamp" values?
>>
>> BEGIN BATCH USING 
>> INSERT INTO Foo (a,b,c) values (?,?,?)
>> ...
>> APPLY BATCH;
>>
>> Once bound or while binding the prepared statement to specific values,
>> I'd like to set the timestamp value.
>>
>> Putting a question mark in for timestamp failed as expected and I don't
>> see a method on the DataStax java driver BoundStatement for setting it.
>>
>> Thanks in advance.
>>
>> /Mike Sample
>>
>
>

backing up and restoring from only 1 replica?

2013-02-28 Thread Mike Koh

It has been suggested to me that we could save a fair amount of time and 
money by taking a snapshot of only 1 replica (so every third node for 
most column families).  Assuming that we are okay with not having the 
absolute latest data, does this have any possibility of working?  I feel 
like it shouldn't but don't really know the argument for why it wouldn't.

Re: backing up and restoring from only 1 replica?

2013-03-04 Thread Mike Koh

Thanks for the response.  Could you elaborate more on the bad things 
that happen during a restart or message drops that would cause a 1 
replica restore to fail?  I'm completely on board with not using a 
restore process that nobody else uses, but I need to convince somebody 
else who thinks that it will work that it is not a good idea.



On 3/4/2013 7:54 AM, aaron morton wrote:

That would be OK only if you never had node go down (e.g. a restart) or drop 
messages.

It's not something I would consider trying.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/02/2013, at 3:21 PM, Mike Koh  wrote:


It has been suggested to me that we could save a fair amount of time and money 
by taking a snapshot of only 1 replica (so every third node for most column 
families).  Assuming that we are okay with not having the absolute latest data, 
does this have any possibility of working?  I feel like it shouldn't but don't 
really know the argument for why it wouldn't.

changing compaction strategy

2017-03-13 Thread Mike Torra

I'm trying to change compaction strategy one node at a time. I'm using
jmxterm like this:

`echo 'set -b
org.apache.cassandra.db:type=ColumnFamilies,keyspace=my_ks,columnfamily=my_cf
CompactionParametersJson
\{"class":"TimeWindowCompactionStrategy","compaction_window_unit":"HOURS","compaction_window_size":"6"\}'
| java -jar jmxterm-1.0-alpha-4-uber.jar --url localhost:7199`

and I see this in the cassandra logs:

INFO  [RMI TCP Connection(37)-127.0.0.1] 2017-03-13 20:29:08,251
CompactionStrategyManager.java:841 - Switching local compaction strategy
from
CompactionParams{class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,
options={max_threshold=32, min_threshold=4}} to
CompactionParams{class=org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy,
options={min_threshold=4, max_threshold=32, compaction_window_unit=HOURS,
compaction_window_size=6}}}

After doing this, `nodetool compactionstats` shows 1 pending compaction,
but none running. Also, cqlsh describe shows the old compaction strategy
still. Am I missing a step?

Re: changing compaction strategy

2017-03-14 Thread Mike Torra

Some more info:
- running C* 3.9
- I tried `nodetool flush` on the column family this change applies to, and
while it does seem to trigger compactions, there is still one pending that
won't seem to run
- I tried `nodetool compact` on the column family as well, with a similar
affect

Is there a way to tell when/if the local node has successfully updated the
compaction strategy? Looking at the sstable files, it seems like they are
still based on STCS but I don't know how to be sure.

Appreciate any tips or suggestions!

On Mon, Mar 13, 2017 at 5:30 PM, Mike Torra  wrote:

> I'm trying to change compaction strategy one node at a time. I'm using
> jmxterm like this:
>
> `echo 'set -b 
> org.apache.cassandra.db:type=ColumnFamilies,keyspace=my_ks,columnfamily=my_cf
> CompactionParametersJson \{"class":"TimeWindowCompactionStrategy",
> "compaction_window_unit":"HOURS","compaction_window_size":"6"\}' | java
> -jar jmxterm-1.0-alpha-4-uber.jar --url localhost:7199`
>
> and I see this in the cassandra logs:
>
> INFO  [RMI TCP Connection(37)-127.0.0.1] 2017-03-13 20:29:08,251
> CompactionStrategyManager.java:841 - Switching local compaction strategy
> from 
> CompactionParams{class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,
> options={max_threshold=32, min_threshold=4}} to CompactionParams{class=org.
> apache.cassandra.db.compaction.TimeWindowCompactionStrategy,
> options={min_threshold=4, max_threshold=32, compaction_window_unit=HOURS,
> compaction_window_size=6}}}
>
> After doing this, `nodetool compactionstats` shows 1 pending compaction,
> but none running. Also, cqlsh describe shows the old compaction strategy
> still. Am I missing a step?
>

sstableloader limitations in multi-dc cluster

2017-06-22 Thread Mike Torra

I'm trying to use sstableloader to bulk load some data to my 4 DC cluster,
and I can't quite get it to work. Here is how I'm trying to run it:

sstableloader -d 127.0.0.1 -i {csv list of private ips of nodes in cluster}
myks/mttest


At first this seems to work, with a steady stream of logging like this
(eventually getting to 100%):

progress: [/10.0.1.225]0:13/13 100% [/10.0.0.134]0:13/13 100%
[/10.0.0.119]0:13/13
100% [/10.0.1.26]0:13/13 100% [/10.0.3.188]0:13/13 100% [/10.0.3.189]0:13/13
100% [/10.0.2.95]0:13/13 100% total: 100% 0.000KiB/s (avg: 13.857MiB/s)


There will be some errors sprinkled in like this:

ERROR 15:35:43 [Stream #707f0920-5760-11e7-8ede-37de75ac1efa] Streaming
error occurred on session with peer 10.0.2.9

java.net.NoRouteToHostException: No route to host


Then, at the end, there will be one last warning about the failed streams:

WARN  15:38:03 [Stream #707f0920-5760-11e7-8ede-37de75ac1efa] Stream failed

Streaming to the following hosts failed:

[/127.0.0.1, {list of same private ips as above}]


I am perplexed about the failures because I am trying to explicitly ignore
the nodes in remote DC's via the -i option to sstableloader. Why doesn't
this work? I've tried using the public IP's instead just for kicks, but
that doesn't change anything. I don't see anything helpful in the cassandra
logs (including debug logs). Also, why is localhost in the list of
failures? I can query the data locally after the sstableloader command
completes.

I've also noticed that sstableloader fails completely (even locally) while
I am decomissioning or bootstrapping a node in a remote DC. Is this a
limitation of sstableloader? I haven't been able to find documentation
about this.

WG: How to sort result-set? / How to proper model a table?

2017-07-26 Thread Mike Wenzel

Hey everyone,

I'm new to Cassandra, going my first steps, having a problem/question regarding 
sorting results and proper data modelling.

First of all, I read the article "We Shall Have Order!" by Aaron Ploetz (1) to 
get a first view on how Cassandra works.
I reproduced the example in the article with my own table.

DROP TABLE sensors;
CREATE TABLE sensors (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
unit VARCHAR,
PRIMARY KEY (name, timestamp)
)
WITH gc_grace_seconds = 0
AND CLUSTERING ORDER BY (timestamp DESC);

I'm actual running Cassandra on a single node ([cqlsh 5.0.1 | Cassandra 3.11.0 
| CQL spec 3.4.4 | Native protocol v4]).
Now some background information about my project:

I want to store all kinds of measuring-data from all kinds sensors. No matter 
if the sensor is measuring a temperature, water flow, or whatever. Sensors 
always give a single value. Interpretation has to be done afterwards by the 
user.
So in my example, I 'm measuring temperatures of my house which leads me to the 
following data:

timestamp   name  value unit
2017-07-24 14-11-00 entrance-a20Celsius
2017-07-24 14-11-04 living-room   24Celsius
2017-07-24 14-11-07 bath-room 22Celsius
2017-07-24 14-11-15 bed-room  23Celsius
2017-07-24 14-11-22 entrance-b20Celsius

I'm measuring time-triggered each 15 minutes. In order to have some kind of 
start and end for each process, I decided to measure the entrance twice with 
different named sensors (entrance a and b). So above is one set of 
measuring-data, created by a single process.
I'd say this is just another perfect example of what Aaron Ploetz describes in 
his article.

When I query Cassandra the result set is not sorted by timestamp as long as I 
won't use the primary key in my WHERE clause.
When I ask myself: "What will I query Cassandra for?" I'm always coming up with 
the same typical thoughts:

* LIST all measuring's in a specific timespan ORDERED BY timestamp 
ASC/DESC

o   Requires ALLOW FILTERING

o   Won't be sorted

* LIST all measuring's for a specific sensor ORDERED BY timestamp 
ASC/DESC

o   Sorted result. OK.

* And stuff the future will bring which I simply don't know now.


So in order to query Cassandra for measuring's in a specific timestamp I can't 
find a solid solution. My first idea was:

* Add a column sequence which can be used to bundle a set of measuring's

DROP TABLE sensors;
CREATE TABLE sensors (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
unit VARCHAR,
sequence INT,
PRIMARY KEY (sequence, timestamp)
)
WITH gc_grace_seconds = 0
AND CLUSTERING ORDER BY (timestamp DESC);


o   I won't need to measure the entrance twice

o   I can query for a timespan as long as the timespan is within a sequence.

?  But when I query a timespan containing more than a single sequence, then the 
result set is not correct sorted again

sequence timestamp   name  value unit
123   2017-07-24 14-11-22 entrance-b20Celsius
123   2017-07-24 14-11-15 bed-room  23Celsius
123   2017-07-24 14-11-07 bath-room 22Celsius
123   2017-07-24 14-11-04 living-room   24Celsius
123   2017-07-24 14-11-00 entrance-a20Celsius
124   2017-07-24 15-11-22 entrance-b22Celsius
124   2017-07-24 15-11-15 bed-room  25Celsius
124   2017-07-24 15-11-07 bath-room 24Celsius
124   2017-07-24 15-11-04 living-room   26Celsius
124   2017-07-24 15-11-00 entrance-a22Celsius


o   Besides: it's not recommended to use a "dummy" column especially not as 
primary or clustering key.

How to solve this problem?
I believe, I can't be the only one who got this requirement. Imho "Sort it on 
the client-side" can't be the solution. As soon as data gets bigger we simply 
can't "just" sort on a client side.
So my next idea was to use the table as overall data storage and create another 
table and periodically transfer data from the main to the child table. But I 
believe I'll get the same problem because Cassandra simply don't sort as an 
RDBMS. So here must be an idea behind the philosophy of Cassandra.

Can anyone help me out?

Best regards
Mike Wenzel


(1)https://www.datastax.com/dev/blog/we-shall-have-order

node restart causes application latency

2018-02-06 Thread Mike Torra

Hi -

I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1
on Ubuntu. Occasionally I have the need to restart nodes in the cluster,
but every time I do, I see errors and application (nodejs) timeouts.

I restart a node like this:

nodetool disablethrift && nodetool disablegossip && nodetool drain
sudo service cassandra restart

When I do that, I very often get timeouts and errors like this in my nodejs
app:

Error: Cannot achieve consistency level LOCAL_ONE

My queries are all pretty much the same, things like: "select * from
history where ts > {current_time}"

The errors and timeouts seem to go away on their own after a while, but it
is frustrating because I can't track down what I am doing wrong!

I've tried waiting between steps of shutting down cassandra, and I've tried
stopping, waiting, then starting the node. One thing I've noticed is that
even after `nodetool drain`ing the node, there are open connections to
other nodes in the cluster (ie looking at the output of netstat) until I
stop cassandra. I don't see any errors or warnings in the logs.

What can I do to prevent this? Is there something else I should be doing to
gracefully restart the cluster? It could be something to do with the nodejs
driver, but I can't find anything there to try.

I appreciate any suggestions or advice.

- Mike

Re: node restart causes application latency

2018-02-07 Thread Mike Torra

Thanks for the feedback guys. That example data model was indeed
abbreviated - the real queries have the partition key in them. I am using
RF 3 on the keyspace, so I don't think a node being down would mean the key
I'm looking for would be unavailable. The load balancing policy of the
driver seems correct (
https://docs.datastax.com/en/developer/nodejs-driver/3.4/features/tuning-policies/#load-balancing-policy,
and I am using the default `TokenAware` policy with
`DCAwareRoundRobinPolicy` as a child), but I will look more closely at the
implementation.

It was an oversight of mine to not include `nodetool disablebinary`, but I
still experience the same issue with that.

One other thing I've noticed is that after restarting a node and seeing
application latency, I also see that the node I just restarted sees many
other nodes in the same DC as being down (ie status 'DN'). However,
checking `nodetool status` on those other nodes shows all nodes as
up/normal. To me this could kind of explain the problem - node comes back
online, thinks it is healthy but many others are not, so it gets traffic
from the client application. But then it gets requests for ranges that
belong to a node it thinks is down, so it responds with an error. The
latency issue seems to start roughly when the node goes down, but persists
long (ie 15-20 mins) after it is back online and accepting connections. It
seems to go away once the bounced node shows the other nodes in the same DC
as up again.

As for speculative retry, my CF is using the default of '99th percentile'.
I could try something different there, but nodes being seen as down seems
like an issue.

On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa  wrote:

> Unless you abbreviated, your data model is questionable (SELECT without
> any equality in the WHERE clause on the partition key will always cause a
> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a
> range scan, timeouts sorta make sense - the owner of at least one range
> would be down for a bit.
>
> If you actually have a partition key in your where clause, then the next
> most likely guess is your clients aren't smart enough to route around the
> node as it restarts, or your key cache is getting cold during the bounce.
> Double check your driver's load balancing policy.
>
> It's also likely the case that speculative retry may help other nodes
> route around the bouncing instance better - if you're not using it, you
> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
> of an issue).
>
> We need to make bouncing nodes easier (or rather, we need to make drain do
> the right thing), but in this case, your data model looks like the biggest
> culprit (unless it's an incomplete recreation).
>
> - Jeff
>
>
> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra  wrote:
>
>> Hi -
>>
>> I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1
>> on Ubuntu. Occasionally I have the need to restart nodes in the cluster,
>> but every time I do, I see errors and application (nodejs) timeouts.
>>
>> I restart a node like this:
>>
>> nodetool disablethrift && nodetool disablegossip && nodetool drain
>> sudo service cassandra restart
>>
>> When I do that, I very often get timeouts and errors like this in my
>> nodejs app:
>>
>> Error: Cannot achieve consistency level LOCAL_ONE
>>
>> My queries are all pretty much the same, things like: "select * from
>> history where ts > {current_time}"
>>
>> The errors and timeouts seem to go away on their own after a while, but
>> it is frustrating because I can't track down what I am doing wrong!
>>
>> I've tried waiting between steps of shutting down cassandra, and I've
>> tried stopping, waiting, then starting the node. One thing I've noticed is
>> that even after `nodetool drain`ing the node, there are open connections to
>> other nodes in the cluster (ie looking at the output of netstat) until I
>> stop cassandra. I don't see any errors or warnings in the logs.
>>
>> What can I do to prevent this? Is there something else I should be doing
>> to gracefully restart the cluster? It could be something to do with the
>> nodejs driver, but I can't find anything there to try.
>>
>> I appreciate any suggestions or advice.
>>
>> - Mike
>>
>
>

Re: node restart causes application latency

2018-02-07 Thread Mike Torra

No, I am not

On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa  wrote:

> Are you using internode ssl?
>
>
> --
> Jeff Jirsa
>
>
> On Feb 7, 2018, at 8:24 AM, Mike Torra  wrote:
>
> Thanks for the feedback guys. That example data model was indeed
> abbreviated - the real queries have the partition key in them. I am using
> RF 3 on the keyspace, so I don't think a node being down would mean the key
> I'm looking for would be unavailable. The load balancing policy of the
> driver seems correct (https://docs.datastax.com/en/
> developer/nodejs-driver/3.4/features/tuning-policies/#
> load-balancing-policy, and I am using the default `TokenAware` policy
> with `DCAwareRoundRobinPolicy` as a child), but I will look more closely at
> the implementation.
>
> It was an oversight of mine to not include `nodetool disablebinary`, but I
> still experience the same issue with that.
>
> One other thing I've noticed is that after restarting a node and seeing
> application latency, I also see that the node I just restarted sees many
> other nodes in the same DC as being down (ie status 'DN'). However,
> checking `nodetool status` on those other nodes shows all nodes as
> up/normal. To me this could kind of explain the problem - node comes back
> online, thinks it is healthy but many others are not, so it gets traffic
> from the client application. But then it gets requests for ranges that
> belong to a node it thinks is down, so it responds with an error. The
> latency issue seems to start roughly when the node goes down, but persists
> long (ie 15-20 mins) after it is back online and accepting connections. It
> seems to go away once the bounced node shows the other nodes in the same DC
> as up again.
>
> As for speculative retry, my CF is using the default of '99th percentile'.
> I could try something different there, but nodes being seen as down seems
> like an issue.
>
> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa  wrote:
>
>> Unless you abbreviated, your data model is questionable (SELECT without
>> any equality in the WHERE clause on the partition key will always cause a
>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a
>> range scan, timeouts sorta make sense - the owner of at least one range
>> would be down for a bit.
>>
>> If you actually have a partition key in your where clause, then the next
>> most likely guess is your clients aren't smart enough to route around the
>> node as it restarts, or your key cache is getting cold during the bounce.
>> Double check your driver's load balancing policy.
>>
>> It's also likely the case that speculative retry may help other nodes
>> route around the bouncing instance better - if you're not using it, you
>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
>> of an issue).
>>
>> We need to make bouncing nodes easier (or rather, we need to make drain
>> do the right thing), but in this case, your data model looks like the
>> biggest culprit (unless it's an incomplete recreation).
>>
>> - Jeff
>>
>>
>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra 
>> wrote:
>>
>>> Hi -
>>>
>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C*
>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the
>>> cluster, but every time I do, I see errors and application (nodejs)
>>> timeouts.
>>>
>>> I restart a node like this:
>>>
>>> nodetool disablethrift && nodetool disablegossip && nodetool drain
>>> sudo service cassandra restart
>>>
>>> When I do that, I very often get timeouts and errors like this in my
>>> nodejs app:
>>>
>>> Error: Cannot achieve consistency level LOCAL_ONE
>>>
>>> My queries are all pretty much the same, things like: "select * from
>>> history where ts > {current_time}"
>>>
>>> The errors and timeouts seem to go away on their own after a while, but
>>> it is frustrating because I can't track down what I am doing wrong!
>>>
>>> I've tried waiting between steps of shutting down cassandra, and I've
>>> tried stopping, waiting, then starting the node. One thing I've noticed is
>>> that even after `nodetool drain`ing the node, there are open connections to
>>> other nodes in the cluster (ie looking at the output of netstat) until I
>>> stop cassandra. I don't see any errors or warnings in the logs.
>>>
>>> What can I do to prevent this? Is there something else I should be doing
>>> to gracefully restart the cluster? It could be something to do with the
>>> nodejs driver, but I can't find anything there to try.
>>>
>>> I appreciate any suggestions or advice.
>>>
>>> - Mike
>>>
>>
>>
>

Re: node restart causes application latency

2018-02-12 Thread Mike Torra

Any other ideas? If I simply stop the node, there is no latency problem,
but once I start the node the problem appears. This happens consistently
for all nodes in the cluster

On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra  wrote:

> No, I am not
>
> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa  wrote:
>
>> Are you using internode ssl?
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 7, 2018, at 8:24 AM, Mike Torra  wrote:
>>
>> Thanks for the feedback guys. That example data model was indeed
>> abbreviated - the real queries have the partition key in them. I am using
>> RF 3 on the keyspace, so I don't think a node being down would mean the key
>> I'm looking for would be unavailable. The load balancing policy of the
>> driver seems correct (https://docs.datastax.com/en/
>> developer/nodejs-driver/3.4/features/tuning-policies/#load-
>> balancing-policy, and I am using the default `TokenAware` policy with
>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the
>> implementation.
>>
>> It was an oversight of mine to not include `nodetool disablebinary`, but
>> I still experience the same issue with that.
>>
>> One other thing I've noticed is that after restarting a node and seeing
>> application latency, I also see that the node I just restarted sees many
>> other nodes in the same DC as being down (ie status 'DN'). However,
>> checking `nodetool status` on those other nodes shows all nodes as
>> up/normal. To me this could kind of explain the problem - node comes back
>> online, thinks it is healthy but many others are not, so it gets traffic
>> from the client application. But then it gets requests for ranges that
>> belong to a node it thinks is down, so it responds with an error. The
>> latency issue seems to start roughly when the node goes down, but persists
>> long (ie 15-20 mins) after it is back online and accepting connections. It
>> seems to go away once the bounced node shows the other nodes in the same DC
>> as up again.
>>
>> As for speculative retry, my CF is using the default of '99th
>> percentile'. I could try something different there, but nodes being seen as
>> down seems like an issue.
>>
>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa  wrote:
>>
>>> Unless you abbreviated, your data model is questionable (SELECT without
>>> any equality in the WHERE clause on the partition key will always cause a
>>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a
>>> range scan, timeouts sorta make sense - the owner of at least one range
>>> would be down for a bit.
>>>
>>> If you actually have a partition key in your where clause, then the next
>>> most likely guess is your clients aren't smart enough to route around the
>>> node as it restarts, or your key cache is getting cold during the bounce.
>>> Double check your driver's load balancing policy.
>>>
>>> It's also likely the case that speculative retry may help other nodes
>>> route around the bouncing instance better - if you're not using it, you
>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
>>> of an issue).
>>>
>>> We need to make bouncing nodes easier (or rather, we need to make drain
>>> do the right thing), but in this case, your data model looks like the
>>> biggest culprit (unless it's an incomplete recreation).
>>>
>>> - Jeff
>>>
>>>
>>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra 
>>> wrote:
>>>
>>>> Hi -
>>>>
>>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C*
>>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the
>>>> cluster, but every time I do, I see errors and application (nodejs)
>>>> timeouts.
>>>>
>>>> I restart a node like this:
>>>>
>>>> nodetool disablethrift && nodetool disablegossip && nodetool drain
>>>> sudo service cassandra restart
>>>>
>>>> When I do that, I very often get timeouts and errors like this in my
>>>> nodejs app:
>>>>
>>>> Error: Cannot achieve consistency level LOCAL_ONE
>>>>
>>>> My queries are all pretty much the same, things like: "select * from
>>>> history where ts > {current_time}"
>>>>
>>>> The errors and timeouts seem to go away on their own after a while, but
>>>> it is frustrating because I can't track down what I am doing wrong!
>>>>
>>>> I've tried waiting between steps of shutting down cassandra, and I've
>>>> tried stopping, waiting, then starting the node. One thing I've noticed is
>>>> that even after `nodetool drain`ing the node, there are open connections to
>>>> other nodes in the cluster (ie looking at the output of netstat) until I
>>>> stop cassandra. I don't see any errors or warnings in the logs.
>>>>
>>>> What can I do to prevent this? Is there something else I should be
>>>> doing to gracefully restart the cluster? It could be something to do with
>>>> the nodejs driver, but I can't find anything there to try.
>>>>
>>>> I appreciate any suggestions or advice.
>>>>
>>>> - Mike
>>>>
>>>
>>>
>>
>

Re: node restart causes application latency

2018-02-12 Thread Mike Torra

Interestingly, it seems that changing the order of steps I take during the
node restart resolves the problem. Instead of:

`nodetool disablebinary && nodetool disablethrift && *nodetool
disablegossip* && nodetool drain && sudo service cassandra restart`,

if I do:

`nodetool disablebinary && nodetool disablethrift && nodetool drain &&
*nodetool
disablegossip* && sudo service cassandra restart`,

I see no application errors, no latency, and no nodes marked as Down/Normal
on the restarted node. Note the only thing I changed is that I moved
`nodetool disablegossip` to after `nodetool drain`. This is pretty
anecdotal, but is there any explanation for why this might happen? I'll be
monitoring my cluster closely to see if this change does indeed fix the
problem.

On Mon, Feb 12, 2018 at 9:33 AM, Mike Torra  wrote:

> Any other ideas? If I simply stop the node, there is no latency problem,
> but once I start the node the problem appears. This happens consistently
> for all nodes in the cluster
>
> On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra  wrote:
>
>> No, I am not
>>
>> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa  wrote:
>>
>>> Are you using internode ssl?
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 7, 2018, at 8:24 AM, Mike Torra  wrote:
>>>
>>> Thanks for the feedback guys. That example data model was indeed
>>> abbreviated - the real queries have the partition key in them. I am using
>>> RF 3 on the keyspace, so I don't think a node being down would mean the key
>>> I'm looking for would be unavailable. The load balancing policy of the
>>> driver seems correct (https://docs.datastax.com/en/
>>> developer/nodejs-driver/3.4/features/tuning-policies/#load-b
>>> alancing-policy, and I am using the default `TokenAware` policy with
>>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the
>>> implementation.
>>>
>>> It was an oversight of mine to not include `nodetool disablebinary`, but
>>> I still experience the same issue with that.
>>>
>>> One other thing I've noticed is that after restarting a node and seeing
>>> application latency, I also see that the node I just restarted sees many
>>> other nodes in the same DC as being down (ie status 'DN'). However,
>>> checking `nodetool status` on those other nodes shows all nodes as
>>> up/normal. To me this could kind of explain the problem - node comes back
>>> online, thinks it is healthy but many others are not, so it gets traffic
>>> from the client application. But then it gets requests for ranges that
>>> belong to a node it thinks is down, so it responds with an error. The
>>> latency issue seems to start roughly when the node goes down, but persists
>>> long (ie 15-20 mins) after it is back online and accepting connections. It
>>> seems to go away once the bounced node shows the other nodes in the same DC
>>> as up again.
>>>
>>> As for speculative retry, my CF is using the default of '99th
>>> percentile'. I could try something different there, but nodes being seen as
>>> down seems like an issue.
>>>
>>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa  wrote:
>>>
>>>> Unless you abbreviated, your data model is questionable (SELECT without
>>>> any equality in the WHERE clause on the partition key will always cause a
>>>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a
>>>> range scan, timeouts sorta make sense - the owner of at least one range
>>>> would be down for a bit.
>>>>
>>>> If you actually have a partition key in your where clause, then the
>>>> next most likely guess is your clients aren't smart enough to route around
>>>> the node as it restarts, or your key cache is getting cold during the
>>>> bounce. Double check your driver's load balancing policy.
>>>>
>>>> It's also likely the case that speculative retry may help other nodes
>>>> route around the bouncing instance better - if you're not using it, you
>>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
>>>> of an issue).
>>>>
>>>> We need to make bouncing nodes easier (or rather, we need to make drain
>>>> do the right thing), but in this case, your data model looks like the
>>>> biggest culprit (unless it's an incomplete recreation).
>>&

Re: node restart causes application latency

2018-02-13 Thread Mike Torra

Then could it be that calling `nodetool drain` after calling `nodetool
disablegossip` is what causes the problem?

On Mon, Feb 12, 2018 at 6:12 PM, kurt greaves  wrote:

>
> Actually, it's not really clear to me why disablebinary and thrift are
> necessary prior to drain, because they happen in the same order during
> drain anyway. It also really doesn't make sense that disabling gossip after
> drain would make a difference here, because it should be already stopped.
> This is all assuming drain isn't erroring out.
>

Re: Slow bulk loading

2015-05-07 Thread Mike Neir

It sounds as though you could be having troubles with Garbage Collection. Check 
your cassandra system logs and search for "GC". If you see frequent garbage 
collections taking more than a second or two to complete, you're going to need 
to do some configuration tweaking.


On 05/07/2015 04:44 AM, Pierre Devops wrote:

Hi,

I m streaming a big sstable using bulk loader of sstableloader but it's very
slow (3 Mbytes/sec) :

Summary statistics:
Connections per host: : 1
Total files transferred:  : 1
Total bytes transferred:  : 10357947484
Total duration (ms):  : 3280229
Average transfer rate (MB/s): : 3
Peak transfer rate (MB/s):: 3

I'm on a single node configuration, empty keyspace and table, with good hardware
8x2.8ghz 32G RAM, dedicated to cassandra, so it's plenty of ressource for the
process. I'm uploading from another server.

The sstable is 9GB in size and have 4 partitions, but a lot of rows per
partition (like 100 millions), the clustering key is a INT and have 4 other
regulars columns, so approximatly 500 millions cells per ColumnFamily.

When I upload I notice one core of the cassandra node is full CPU (all other
cores are idleing), so I assume I'm CPU bound on node side. But why ? What the
node is doing ? Why does it take so long time ?



--



Mike Neir
Liquid Web, Inc.
Infrastructure Administrator

Counters 2.1 Accuracy

2015-06-22 Thread Mike Trienis

Hi All,

I'm fairly new to Cassandra and am planning on using it as a datastore for
an Apache Spark cluster.

The use case is fairly simple, read the raw data and perform aggregates and
push the rolled up data back to Cassandra. The data models will use
counters pretty heavily so I'd like to understand what kind of accuracy
should I expect from Cassandra 2.1 when increment the counters.

   -
   
http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters

The blog post above states that the new counter implementations are "safer"
although I'm not sure what that means in practice. Will the counters be
99.99% accurate? How often will they be over or under counted?

Thanks, Mike.

making sense of output from Eclipse Memory Analyzer tool taken from .hprof file

2013-11-14 Thread Mike Koh

I am investigating Java Out of memory heap errors. So I created an .hprof 
file and loaded it into Eclipse Memory Analyzer Tool which gave some 
"Problem Suspects".


First one looks like:

One instance of "org.apache.cassandra.db.ColumnFamilyStore" loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8" occupies 984,094,664 
(11.64%) bytes. The memory is accumulated in one instance of 
"org.apache.cassandra.db.DataTracker$View" loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8".



If I click around into the verbiage, I believe I can pick out the name of 
a column family but that is about it. Can someone explain what the above 
means in more detail and if it is indicative of a problem?



Next one looks like:
-
•java.lang.Thread @ 0x73e1f74c8 CompactionExecutor:158 - 839,225,000 
(9.92%) bytes.

•java.lang.Thread @ 0x717f08178 MutationStage:31 - 809,909,192 (9.58%) bytes.
•java.lang.Thread @ 0x717f082c8 MutationStage:5 - 649,667,472 (7.68%) bytes.
•java.lang.Thread @ 0x717f083a8 MutationStage:21 - 498,081,544 (5.89%) bytes.
•java.lang.Thread @ 0x71b357e70 MutationStage:11 - 444,931,288 (5.26%) bytes.
--
If I click into the verbiage, they above Compaction and Mutations all seem 
to be referencing the same column family. Are the above related? Is there 
a way I can tell more exactly what is being compacted and/or mutated more 
specifically than which column family?

Re: Is there any open source software for automatized deploy C* in PRD?

2013-11-21 Thread Mike Adamson

Hi Boole,

Have you tried chef? There is this cookbook for deploying cassandra:

http://community.opscode.com/cookbooks/cassandra

MikeA


On 21 November 2013 01:33, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 <
boole.z@newegg.com> wrote:

>  Hi all,
>
> Is there any open source software for automatized deploy C* in PRD?
>
>
>
> Best Regards,
>
> *Boole Guo*
>
> *Software Engineer, NESC-SH.MIS*
>
> *+86-021-51530666 <%2B86-021-51530666>*41442*
>
> *Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042)*
>
> *ONCE YOU KNOW, YOU NEWEGG.*
>
> *CONFIDENTIALITY NOTICE: This email and any files transmitted with it may
> contain privileged or otherwise confidential information. It is intended
> only for the person or persons to whom it is addressed. If you received
> this message in error, you are not authorized to read, print, retain, copy,
> disclose, disseminate, distribute, or use this message any part thereof or
> any information contained therein. Please notify the sender immediately and
> delete all copies of this message. Thank you in advance for your
> cooperation.*
>
>
> 保密注意：此邮件及其附随文件可能包含了保密信息。该邮件的目的是发送给指定收件人。如果您非指定收件人而错误地收到了本邮件，您将无权阅读、打印、保存、复制、泄露、传播、分发或使用此邮件全部或部分内容或者邮件中包含的任何信息。请立即通知发件人，并删除该邮件。感谢您的配合！
>
>
>

How to restart bootstrap after a failed streaming due to Broken Pipe (1.2.16)

2014-06-09 Thread Mike Heffner

Hi,

During an attempt to bootstrap a new node into a 1.2.16 ring the new node
saw one of the streaming nodes periodically disappear:

 INFO [GossipTasks:1] 2014-06-10 00:28:52,572 Gossiper.java (line 823)
InetAddress /10.156.1.2 is now DOWN
ERROR [GossipTasks:1] 2014-06-10 00:28:52,574 AbstractStreamSession.java
(line 108) Stream failed because /10.156.1.2 died or was restarted/removed
(streams may still be active in background, but further streams won't be
started)
 WARN [GossipTasks:1] 2014-06-10 00:28:52,574 RangeStreamer.java (line 246)
Streaming from /10.156.1.2 failed
 INFO [HANDSHAKE-/10.156.1.2] 2014-06-10 00:28:57,922
OutboundTcpConnection.java (line 418) Handshaking version with /10.156.1.2
 INFO [GossipStage:1] 2014-06-10 00:28:57,943 Gossiper.java (line 809)
InetAddress /10.156.1.2 is now UP

This brief interruption was enough to kill the streaming from node
10.156.1.2. Node 10.156.1.2 saw a similar "broken pipe" exception from the
bootstrapping node:

ERROR [Streaming to /10.156.193.1.3] 2014-06-10 01:22:02,345
CassandraDaemon.java (line 191) Exception in thread Thread[Streaming to /
10.156.1.3:1,5,main]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:420)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:552)
at
org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93)
at
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)


During bootstrapping we notice a significant spike in CPU and latency
across the board on the ring (CPU 50->85% and write latencies 60ms ->
250ms). It seems likely that this persistent high load led to the hiccup
that caused the gossiper to see the streaming node as briefly down.

What is the proper way to recover from this? The original estimate was
almost 24 hours to stream all the data required to bootstrap this single
node (streaming set to unlimited) and this occurred 6 hours into the
bootstrap. With such high load from streaming it seems that simply
restarting will inevitably hit this problem again.


Cheers,

Mike

-- 

  Mike Heffner 
  Librato, Inc.

1 2 3 >

1 - 100 of 243 matches

Mail list logo