RE: high context switches

2014-11-24 Thread Jan Karlsson
We use CQL with 1 session per client and default connection settings.

I do not think that we are using too many client threads. Number of native 
transport threads is set to default (max 128).


From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: den 21 november 2014 19:30
To: user@cassandra.apache.org
Subject: Re: high context switches

On Fri, Nov 21, 2014 at 1:21 AM, Jan Karlsson 
mailto:jan.karls...@ericsson.com>> wrote:
Nothing really wrong with that however I would like to understand why these 
numbers are so high. Have others noticed this behavior? How much context 
switching is expected and why? What are the variables that affect this?

I +1 Nikolai's conjecture that you are probably using a very high number of 
client threads.

However as a general statement Cassandra is highly multi-threaded. Threads are 
assigned within thread pools and these thread pools can be thought of as a type 
of processing pipeline, such that one is often the input to another. When 
pushing Cassandra near its maximum capacity, you will therefore spend a lot of 
time switching between threads.

=Rob
http://twitter.com/rcolidba


Re: [jira] Akhtar Hussain shared a search result with you

2014-11-24 Thread Akhtar Hussain
This error occurred when we took one node from remote DC down. Our main
concern is the *org.apache.cassandra.thrift*.*TimedOutException* exception
in our application logs. Why read failed when we used LOCAL_QUORUM. Failure
of a node in other DC must not impact our DC if we are using LOCAL_QUORUM.
Second question is Why Rapid Read Protection is not working as expected?

Br/

On Fri, Nov 21, 2014 at 5:33 PM, Akhtar Hussain 
wrote:

> Thats true.Will re look in to server logs and get back.
>
> Br/Akhtar
>
> On Fri, Nov 21, 2014 at 5:09 PM, Mark Reddy 
> wrote:
>
>> I believe you were attempting to share:
>> https://issues.apache.org/jira/browse/CASSANDRA-8352
>>
>> Your cassandra logs outputs the following:
>>
>>> DEBUG [Thrift:4] 2014-11-20 15:36:50,653 CustomTThreadPoolServer.java
>>> (line 204) Thrift transport error occurred during processing of message.
>>> org.apache.thrift.transport.TTransportException: Cannot read. Remote
>>> side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is
>>> often indicative of an internal error on the server side. Please check your
>>> server logs.)
>>
>>
>> Which indicates that your server is under pressure at that moment and
>> points you to look at your server logs for further diagnosis.
>>
>>
>> Mark
>>
>> On 21 November 2014 11:15, Akhtar Hussain (JIRA)  wrote:
>>
>>> Akhtar Hussain shared a search result with you
>>> -
>>>
>>>
>>> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=reporter+%3D+currentUser%28%29+ORDER+BY+createdDate+DESC
>>>
>>> We have a Geo-red setup with 2 Data centers having 3 nodes each.
>>> When we bring down a single Cassandra node down in DC2 by kill -9
>>> , reads fail on DC1 with TimedOutException for a brief
>>> amount of time (15-20 sec~).
>>>
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.3.4#6332)
>>>
>>
>>
>


Re: Compaction Strategy guidance

2014-11-24 Thread Nikolai Grigoriev
Jean-Armel,

I have only two large tables, the rest is super-small. In the test cluster
of 15 nodes the largest table has about 110M rows. Its total size is about
1,26Gb per node (total disk space used per node for that CF). It's got
about 5K sstables per node - the sstable size is 256Mb. cfstats on a
"healthy" node look like this:

Read Count: 8973748
Read Latency: 16.130059053251774 ms.
Write Count: 32099455
Write Latency: 1.6124713938912671 ms.
Pending Tasks: 0
Table: wm_contacts
SSTable count: 5195
SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000, 0,
0, 0, 0]
Space used (live), bytes: 1266060391852
Space used (total), bytes: 1266144170869
SSTable Compression Ratio: 0.32604853410787327
Number of keys (estimate): 25696000
Memtable cell count: 71402
Memtable data size, bytes: 26938402
Memtable switch count: 9489
Local read count: 8973748
Local read latency: 17.696 ms
Local write count: 32099471
Local write latency: 1.732 ms
Pending tasks: 0
Bloom filter false positives: 32248
Bloom filter false ratio: 0.50685
Bloom filter space used, bytes: 20744432
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 3379391
Compacted partition mean bytes: 172660
Average live cells per slice (last five minutes): 495.0
Average tombstones per slice (last five minutes): 0.0

Another table of similar structure (same number of rows) is about 4x times
smaller. That table does not suffer from those issues - it compacts well
and efficiently.

On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce  wrote:

> Hi Nikolai,
>
> Please could you clarify a little bit what you call "a large amount of
> data" ?
>
> How many tables ?
> How many rows in your largest table ?
> How many GB in your largest table ?
> How many GB per node ?
>
> Thanks.
>
>
>
> 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce :
>
>> Hi Nikolai,
>>
>> Thanks for those informations.
>>
>> Please could you clarify a little bit what you call "
>>
>> 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev :
>>
>>> Just to clarify - when I was talking about the large amount of data I
>>> really meant large amount of data per node in a single CF (table). LCS does
>>> not seem to like it when it gets thousands of sstables (makes 4-5 levels).
>>>
>>> When bootstraping a new node you'd better enable that option from
>>> CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
>>> mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it
>>> had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
>>> not go down. Number of sstables at L0  is over 11K and it is slowly slowly
>>> building upper levels. Total number of sstables is 4x the normal amount.
>>> Now I am not entirely sure if this node will ever get back to normal life.
>>> And believe me - this is not because of I/O, I have SSDs everywhere and 16
>>> physical cores. This machine is barely using 1-3 cores at most of the time.
>>> The problem is that allowing STCS fallback is not a good option either - it
>>> will quickly result in a few 200Gb+ sstables in my configuration and then
>>> these sstables will never be compacted. Plus, it will require close to 2x
>>> disk space on EVERY disk in my JBOD configuration...this will kill the node
>>> sooner or later. This is all because all sstables after bootstrap end at L0
>>> and then the process slowly slowly moves them to other levels. If you have
>>> write traffic to that CF then the number of sstables and L0 will grow
>>> quickly - like it happens in my case now.
>>>
>>> Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301
>>> is implemented it may be better.
>>>
>>>
>>> On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov 
>>> wrote:
>>>
 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. 
 wrote:
 > ABUSE
 >
 >
 >
 > YA NO QUIERO MAS MAILS SOY DE MEXICO
 >
 >
 >
 > De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
 > Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
 > Para: user@cassandra.apache.org
 > Asunto: Re: Compaction Strategy guidance
 > Importancia: Alta
 >
 >
 >
 > Stephane,
 >
 > As everything good, LCS 

Re: Compaction Strategy guidance

2014-11-24 Thread Andrei Ivanov
Nikolai,

Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
with 256Mb table size...

Andrei

On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev  wrote:
> Jean-Armel,
>
> I have only two large tables, the rest is super-small. In the test cluster
> of 15 nodes the largest table has about 110M rows. Its total size is about
> 1,26Gb per node (total disk space used per node for that CF). It's got about
> 5K sstables per node - the sstable size is 256Mb. cfstats on a "healthy"
> node look like this:
>
> Read Count: 8973748
> Read Latency: 16.130059053251774 ms.
> Write Count: 32099455
> Write Latency: 1.6124713938912671 ms.
> Pending Tasks: 0
> Table: wm_contacts
> SSTable count: 5195
> SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000, 0,
> 0, 0, 0]
> Space used (live), bytes: 1266060391852
> Space used (total), bytes: 1266144170869
> SSTable Compression Ratio: 0.32604853410787327
> Number of keys (estimate): 25696000
> Memtable cell count: 71402
> Memtable data size, bytes: 26938402
> Memtable switch count: 9489
> Local read count: 8973748
> Local read latency: 17.696 ms
> Local write count: 32099471
> Local write latency: 1.732 ms
> Pending tasks: 0
> Bloom filter false positives: 32248
> Bloom filter false ratio: 0.50685
> Bloom filter space used, bytes: 20744432
> Compacted partition minimum bytes: 104
> Compacted partition maximum bytes: 3379391
> Compacted partition mean bytes: 172660
> Average live cells per slice (last five minutes): 495.0
> Average tombstones per slice (last five minutes): 0.0
>
> Another table of similar structure (same number of rows) is about 4x times
> smaller. That table does not suffer from those issues - it compacts well and
> efficiently.
>
> On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce  wrote:
>>
>> Hi Nikolai,
>>
>> Please could you clarify a little bit what you call "a large amount of
>> data" ?
>>
>> How many tables ?
>> How many rows in your largest table ?
>> How many GB in your largest table ?
>> How many GB per node ?
>>
>> Thanks.
>>
>>
>>
>> 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce :
>>>
>>> Hi Nikolai,
>>>
>>> Thanks for those informations.
>>>
>>> Please could you clarify a little bit what you call "
>>>
>>> 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev :

 Just to clarify - when I was talking about the large amount of data I
 really meant large amount of data per node in a single CF (table). LCS does
 not seem to like it when it gets thousands of sstables (makes 4-5 levels).

 When bootstraping a new node you'd better enable that option from
 CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
 mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it 
 had
 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does not go
 down. Number of sstables at L0  is over 11K and it is slowly slowly 
 building
 upper levels. Total number of sstables is 4x the normal amount. Now I am 
 not
 entirely sure if this node will ever get back to normal life. And believe 
 me
 - this is not because of I/O, I have SSDs everywhere and 16 physical cores.
 This machine is barely using 1-3 cores at most of the time. The problem is
 that allowing STCS fallback is not a good option either - it will quickly
 result in a few 200Gb+ sstables in my configuration and then these sstables
 will never be compacted. Plus, it will require close to 2x disk space on
 EVERY disk in my JBOD configuration...this will kill the node sooner or
 later. This is all because all sstables after bootstrap end at L0 and then
 the process slowly slowly moves them to other levels. If you have write
 traffic to that CF then the number of sstables and L0 will grow quickly -
 like it happens in my case now.

 Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301
 is implemented it may be better.


 On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov 
 wrote:
>
> Stephane,
>
> We are having a somewhat similar C* load profile. Hence some comments
> in addition Nikolai's answer.
> 1. Fallback to STCS - you can disable it actually
> 2. Based on our experience, if you have a lot of data per node, LCS
> may work just fine. That is, till the moment you decide to join
> another node - chances are that the newly added node will not be able
> to compact what it gets from old nodes. In your case, if you switch
> strategy the same thing may happen. This is all due to limitations
> mentioned by Nikolai.
>
> Andrei,
>
>
> On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. 
> wrote:
> > ABUSE
> >
> >
> >
> > YA NO QUIER

Re: Compaction Strategy guidance

2014-11-24 Thread Nikolai Grigoriev
Andrei,

Oh, Monday mornings...Tb :)

On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov  wrote:

> Nikolai,
>
> Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
> with 256Mb table size...
>
> Andrei
>
> On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev 
> wrote:
> > Jean-Armel,
> >
> > I have only two large tables, the rest is super-small. In the test
> cluster
> > of 15 nodes the largest table has about 110M rows. Its total size is
> about
> > 1,26Gb per node (total disk space used per node for that CF). It's got
> about
> > 5K sstables per node - the sstable size is 256Mb. cfstats on a "healthy"
> > node look like this:
> >
> > Read Count: 8973748
> > Read Latency: 16.130059053251774 ms.
> > Write Count: 32099455
> > Write Latency: 1.6124713938912671 ms.
> > Pending Tasks: 0
> > Table: wm_contacts
> > SSTable count: 5195
> > SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000,
> 0,
> > 0, 0, 0]
> > Space used (live), bytes: 1266060391852
> > Space used (total), bytes: 1266144170869
> > SSTable Compression Ratio: 0.32604853410787327
> > Number of keys (estimate): 25696000
> > Memtable cell count: 71402
> > Memtable data size, bytes: 26938402
> > Memtable switch count: 9489
> > Local read count: 8973748
> > Local read latency: 17.696 ms
> > Local write count: 32099471
> > Local write latency: 1.732 ms
> > Pending tasks: 0
> > Bloom filter false positives: 32248
> > Bloom filter false ratio: 0.50685
> > Bloom filter space used, bytes: 20744432
> > Compacted partition minimum bytes: 104
> > Compacted partition maximum bytes: 3379391
> > Compacted partition mean bytes: 172660
> > Average live cells per slice (last five minutes): 495.0
> > Average tombstones per slice (last five minutes): 0.0
> >
> > Another table of similar structure (same number of rows) is about 4x
> times
> > smaller. That table does not suffer from those issues - it compacts well
> and
> > efficiently.
> >
> > On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce 
> wrote:
> >>
> >> Hi Nikolai,
> >>
> >> Please could you clarify a little bit what you call "a large amount of
> >> data" ?
> >>
> >> How many tables ?
> >> How many rows in your largest table ?
> >> How many GB in your largest table ?
> >> How many GB per node ?
> >>
> >> Thanks.
> >>
> >>
> >>
> >> 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce :
> >>>
> >>> Hi Nikolai,
> >>>
> >>> Thanks for those informations.
> >>>
> >>> Please could you clarify a little bit what you call "
> >>>
> >>> 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev :
> 
>  Just to clarify - when I was talking about the large amount of data I
>  really meant large amount of data per node in a single CF (table).
> LCS does
>  not seem to like it when it gets thousands of sstables (makes 4-5
> levels).
> 
>  When bootstraping a new node you'd better enable that option from
>  CASSANDRA-6621 (the one that disables STCS in L0). But it will still
> be a
>  mess - I have a node that I have bootstrapped ~2 weeks ago. Initially
> it had
>  7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
> not go
>  down. Number of sstables at L0  is over 11K and it is slowly slowly
> building
>  upper levels. Total number of sstables is 4x the normal amount. Now I
> am not
>  entirely sure if this node will ever get back to normal life. And
> believe me
>  - this is not because of I/O, I have SSDs everywhere and 16 physical
> cores.
>  This machine is barely using 1-3 cores at most of the time. The
> problem is
>  that allowing STCS fallback is not a good option either - it will
> quickly
>  result in a few 200Gb+ sstables in my configuration and then these
> sstables
>  will never be compacted. Plus, it will require close to 2x disk space
> on
>  EVERY disk in my JBOD configuration...this will kill the node sooner
> or
>  later. This is all because all sstables after bootstrap end at L0 and
> then
>  the process slowly slowly moves them to other levels. If you have
> write
>  traffic to that CF then the number of sstables and L0 will grow
> quickly -
>  like it happens in my case now.
> 
>  Once something like
> https://issues.apache.org/jira/browse/CASSANDRA-8301
>  is implemented it may be better.
> 
> 
>  On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov 
>  wrote:
> >
> > Stephane,
> >
> > We are having a somewhat similar C* load profile. Hence some comments
> > in addition Nikolai's answer.
> > 1. Fallback to STCS - you can disable it actually
> > 2. Based on our experience, if you have a lot of data per node, LCS
> > may work just fine. That is, till the moment you decide to join
> > another node - chances are that the newly added node will no

Re: Compaction Strategy guidance

2014-11-24 Thread Andrei Ivanov
Nikolai,

This is more or less what I'm seeing on my cluster then. Trying to
switch to bigger sstables right now (1Gb)

On Mon, Nov 24, 2014 at 5:18 PM, Nikolai Grigoriev  wrote:
> Andrei,
>
> Oh, Monday mornings...Tb :)
>
> On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov  wrote:
>>
>> Nikolai,
>>
>> Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
>> with 256Mb table size...
>>
>> Andrei
>>
>> On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev 
>> wrote:
>> > Jean-Armel,
>> >
>> > I have only two large tables, the rest is super-small. In the test
>> > cluster
>> > of 15 nodes the largest table has about 110M rows. Its total size is
>> > about
>> > 1,26Gb per node (total disk space used per node for that CF). It's got
>> > about
>> > 5K sstables per node - the sstable size is 256Mb. cfstats on a "healthy"
>> > node look like this:
>> >
>> > Read Count: 8973748
>> > Read Latency: 16.130059053251774 ms.
>> > Write Count: 32099455
>> > Write Latency: 1.6124713938912671 ms.
>> > Pending Tasks: 0
>> > Table: wm_contacts
>> > SSTable count: 5195
>> > SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000,
>> > 0,
>> > 0, 0, 0]
>> > Space used (live), bytes: 1266060391852
>> > Space used (total), bytes: 1266144170869
>> > SSTable Compression Ratio: 0.32604853410787327
>> > Number of keys (estimate): 25696000
>> > Memtable cell count: 71402
>> > Memtable data size, bytes: 26938402
>> > Memtable switch count: 9489
>> > Local read count: 8973748
>> > Local read latency: 17.696 ms
>> > Local write count: 32099471
>> > Local write latency: 1.732 ms
>> > Pending tasks: 0
>> > Bloom filter false positives: 32248
>> > Bloom filter false ratio: 0.50685
>> > Bloom filter space used, bytes: 20744432
>> > Compacted partition minimum bytes: 104
>> > Compacted partition maximum bytes: 3379391
>> > Compacted partition mean bytes: 172660
>> > Average live cells per slice (last five minutes): 495.0
>> > Average tombstones per slice (last five minutes): 0.0
>> >
>> > Another table of similar structure (same number of rows) is about 4x
>> > times
>> > smaller. That table does not suffer from those issues - it compacts well
>> > and
>> > efficiently.
>> >
>> > On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce 
>> > wrote:
>> >>
>> >> Hi Nikolai,
>> >>
>> >> Please could you clarify a little bit what you call "a large amount of
>> >> data" ?
>> >>
>> >> How many tables ?
>> >> How many rows in your largest table ?
>> >> How many GB in your largest table ?
>> >> How many GB per node ?
>> >>
>> >> Thanks.
>> >>
>> >>
>> >>
>> >> 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce :
>> >>>
>> >>> Hi Nikolai,
>> >>>
>> >>> Thanks for those informations.
>> >>>
>> >>> Please could you clarify a little bit what you call "
>> >>>
>> >>> 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev :
>> 
>>  Just to clarify - when I was talking about the large amount of data I
>>  really meant large amount of data per node in a single CF (table).
>>  LCS does
>>  not seem to like it when it gets thousands of sstables (makes 4-5
>>  levels).
>> 
>>  When bootstraping a new node you'd better enable that option from
>>  CASSANDRA-6621 (the one that disables STCS in L0). But it will still
>>  be a
>>  mess - I have a node that I have bootstrapped ~2 weeks ago. Initially
>>  it had
>>  7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
>>  not go
>>  down. Number of sstables at L0  is over 11K and it is slowly slowly
>>  building
>>  upper levels. Total number of sstables is 4x the normal amount. Now I
>>  am not
>>  entirely sure if this node will ever get back to normal life. And
>>  believe me
>>  - this is not because of I/O, I have SSDs everywhere and 16 physical
>>  cores.
>>  This machine is barely using 1-3 cores at most of the time. The
>>  problem is
>>  that allowing STCS fallback is not a good option either - it will
>>  quickly
>>  result in a few 200Gb+ sstables in my configuration and then these
>>  sstables
>>  will never be compacted. Plus, it will require close to 2x disk space
>>  on
>>  EVERY disk in my JBOD configuration...this will kill the node sooner
>>  or
>>  later. This is all because all sstables after bootstrap end at L0 and
>>  then
>>  the process slowly slowly moves them to other levels. If you have
>>  write
>>  traffic to that CF then the number of sstables and L0 will grow
>>  quickly -
>>  like it happens in my case now.
>> 
>>  Once something like
>>  https://issues.apache.org/jira/browse/CASSANDRA-8301
>>  is implemented it may be better.
>> 
>> 
>>  On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov 
>>  wrote:
>

Re: Compaction Strategy guidance

2014-11-24 Thread Nikolai Grigoriev
I was thinking about that option and I would be curious to find out how
does this change help you. I suspected that increasing sstable size won't
help too much because the compaction throughput (per task/thread) is still
the same. So, it will simply take 4x longer to finish a compaction task. It
is possible that because of that the CPU will be under-used for even
longer.

My data model, unfortunately, requires this amount of data. And I suspect
that regardless of how it is organized I won't be able to optimize it - I
do need these rows to be in one row so I can read them quickly.

One of the obvious recommendations I have received was to run more than one
instance of C* per host. Makes sense - it will reduce the amount of data
per node and will make better use of the resources. I would go for it
myself, but it may be a challenge for the people in operations. Without a
VM this would be more tricky for them to operate such a thing and I do not
want any VMs there.

Another option is to probably simply shard my data between several
identical tables in the same keyspace. I could also think about different
keyspaces but I prefer not to spread the data for the same logical "tenant"
across multiple keyspaces. Use my primary key's hash and then simply do
something like mod 4 and add this to the table name :) This would
effectively reduce the number of sstables and amount of data per table
(CF). I kind of like this idea more - yes, a bit more challenge at coding
level but obvious benefits without extra operational complexity.


On Mon, Nov 24, 2014 at 9:32 AM, Andrei Ivanov  wrote:

> Nikolai,
>
> This is more or less what I'm seeing on my cluster then. Trying to
> switch to bigger sstables right now (1Gb)
>
> On Mon, Nov 24, 2014 at 5:18 PM, Nikolai Grigoriev 
> wrote:
> > Andrei,
> >
> > Oh, Monday mornings...Tb :)
> >
> > On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov 
> wrote:
> >>
> >> Nikolai,
> >>
> >> Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
> >> with 256Mb table size...
> >>
> >> Andrei
> >>
> >> On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev <
> ngrigor...@gmail.com>
> >> wrote:
> >> > Jean-Armel,
> >> >
> >> > I have only two large tables, the rest is super-small. In the test
> >> > cluster
> >> > of 15 nodes the largest table has about 110M rows. Its total size is
> >> > about
> >> > 1,26Gb per node (total disk space used per node for that CF). It's got
> >> > about
> >> > 5K sstables per node - the sstable size is 256Mb. cfstats on a
> "healthy"
> >> > node look like this:
> >> >
> >> > Read Count: 8973748
> >> > Read Latency: 16.130059053251774 ms.
> >> > Write Count: 32099455
> >> > Write Latency: 1.6124713938912671 ms.
> >> > Pending Tasks: 0
> >> > Table: wm_contacts
> >> > SSTable count: 5195
> >> > SSTables in each level: [27/4, 11/10, 104/100, 1053/1000,
> 4000,
> >> > 0,
> >> > 0, 0, 0]
> >> > Space used (live), bytes: 1266060391852
> >> > Space used (total), bytes: 1266144170869
> >> > SSTable Compression Ratio: 0.32604853410787327
> >> > Number of keys (estimate): 25696000
> >> > Memtable cell count: 71402
> >> > Memtable data size, bytes: 26938402
> >> > Memtable switch count: 9489
> >> > Local read count: 8973748
> >> > Local read latency: 17.696 ms
> >> > Local write count: 32099471
> >> > Local write latency: 1.732 ms
> >> > Pending tasks: 0
> >> > Bloom filter false positives: 32248
> >> > Bloom filter false ratio: 0.50685
> >> > Bloom filter space used, bytes: 20744432
> >> > Compacted partition minimum bytes: 104
> >> > Compacted partition maximum bytes: 3379391
> >> > Compacted partition mean bytes: 172660
> >> > Average live cells per slice (last five minutes): 495.0
> >> > Average tombstones per slice (last five minutes): 0.0
> >> >
> >> > Another table of similar structure (same number of rows) is about 4x
> >> > times
> >> > smaller. That table does not suffer from those issues - it compacts
> well
> >> > and
> >> > efficiently.
> >> >
> >> > On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce 
> >> > wrote:
> >> >>
> >> >> Hi Nikolai,
> >> >>
> >> >> Please could you clarify a little bit what you call "a large amount
> of
> >> >> data" ?
> >> >>
> >> >> How many tables ?
> >> >> How many rows in your largest table ?
> >> >> How many GB in your largest table ?
> >> >> How many GB per node ?
> >> >>
> >> >> Thanks.
> >> >>
> >> >>
> >> >>
> >> >> 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce :
> >> >>>
> >> >>> Hi Nikolai,
> >> >>>
> >> >>> Thanks for those informations.
> >> >>>
> >> >>> Please could you clarify a little bit what you call "
> >> >>>
> >> >>> 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev :
> >> 
> >>  Just to clarify - when I was talking about the large amount of
> data I
> >>  really meant large amount of data per node in a single 

Re: Compaction Strategy guidance

2014-11-24 Thread Andrei Ivanov
OK, let's see - my cluster is recompacting now;-) I will let you know
if this helps

On Mon, Nov 24, 2014 at 5:48 PM, Nikolai Grigoriev  wrote:
> I was thinking about that option and I would be curious to find out how does
> this change help you. I suspected that increasing sstable size won't help
> too much because the compaction throughput (per task/thread) is still the
> same. So, it will simply take 4x longer to finish a compaction task. It is
> possible that because of that the CPU will be under-used for even longer.
>
> My data model, unfortunately, requires this amount of data. And I suspect
> that regardless of how it is organized I won't be able to optimize it - I do
> need these rows to be in one row so I can read them quickly.
>
> One of the obvious recommendations I have received was to run more than one
> instance of C* per host. Makes sense - it will reduce the amount of data per
> node and will make better use of the resources. I would go for it myself,
> but it may be a challenge for the people in operations. Without a VM this
> would be more tricky for them to operate such a thing and I do not want any
> VMs there.
>
> Another option is to probably simply shard my data between several identical
> tables in the same keyspace. I could also think about different keyspaces
> but I prefer not to spread the data for the same logical "tenant" across
> multiple keyspaces. Use my primary key's hash and then simply do something
> like mod 4 and add this to the table name :) This would effectively reduce
> the number of sstables and amount of data per table (CF). I kind of like
> this idea more - yes, a bit more challenge at coding level but obvious
> benefits without extra operational complexity.
>
>
> On Mon, Nov 24, 2014 at 9:32 AM, Andrei Ivanov  wrote:
>>
>> Nikolai,
>>
>> This is more or less what I'm seeing on my cluster then. Trying to
>> switch to bigger sstables right now (1Gb)
>>
>> On Mon, Nov 24, 2014 at 5:18 PM, Nikolai Grigoriev 
>> wrote:
>> > Andrei,
>> >
>> > Oh, Monday mornings...Tb :)
>> >
>> > On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov 
>> > wrote:
>> >>
>> >> Nikolai,
>> >>
>> >> Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
>> >> with 256Mb table size...
>> >>
>> >> Andrei
>> >>
>> >> On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev
>> >> 
>> >> wrote:
>> >> > Jean-Armel,
>> >> >
>> >> > I have only two large tables, the rest is super-small. In the test
>> >> > cluster
>> >> > of 15 nodes the largest table has about 110M rows. Its total size is
>> >> > about
>> >> > 1,26Gb per node (total disk space used per node for that CF). It's
>> >> > got
>> >> > about
>> >> > 5K sstables per node - the sstable size is 256Mb. cfstats on a
>> >> > "healthy"
>> >> > node look like this:
>> >> >
>> >> > Read Count: 8973748
>> >> > Read Latency: 16.130059053251774 ms.
>> >> > Write Count: 32099455
>> >> > Write Latency: 1.6124713938912671 ms.
>> >> > Pending Tasks: 0
>> >> > Table: wm_contacts
>> >> > SSTable count: 5195
>> >> > SSTables in each level: [27/4, 11/10, 104/100, 1053/1000,
>> >> > 4000,
>> >> > 0,
>> >> > 0, 0, 0]
>> >> > Space used (live), bytes: 1266060391852
>> >> > Space used (total), bytes: 1266144170869
>> >> > SSTable Compression Ratio: 0.32604853410787327
>> >> > Number of keys (estimate): 25696000
>> >> > Memtable cell count: 71402
>> >> > Memtable data size, bytes: 26938402
>> >> > Memtable switch count: 9489
>> >> > Local read count: 8973748
>> >> > Local read latency: 17.696 ms
>> >> > Local write count: 32099471
>> >> > Local write latency: 1.732 ms
>> >> > Pending tasks: 0
>> >> > Bloom filter false positives: 32248
>> >> > Bloom filter false ratio: 0.50685
>> >> > Bloom filter space used, bytes: 20744432
>> >> > Compacted partition minimum bytes: 104
>> >> > Compacted partition maximum bytes: 3379391
>> >> > Compacted partition mean bytes: 172660
>> >> > Average live cells per slice (last five minutes): 495.0
>> >> > Average tombstones per slice (last five minutes): 0.0
>> >> >
>> >> > Another table of similar structure (same number of rows) is about 4x
>> >> > times
>> >> > smaller. That table does not suffer from those issues - it compacts
>> >> > well
>> >> > and
>> >> > efficiently.
>> >> >
>> >> > On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce 
>> >> > wrote:
>> >> >>
>> >> >> Hi Nikolai,
>> >> >>
>> >> >> Please could you clarify a little bit what you call "a large amount
>> >> >> of
>> >> >> data" ?
>> >> >>
>> >> >> How many tables ?
>> >> >> How many rows in your largest table ?
>> >> >> How many GB in your largest table ?
>> >> >> How many GB per node ?
>> >> >>
>> >> >> Thanks.
>> >> >>
>> >> >>
>> >> >>
>> >> >> 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce :
>> >> >>>
>> >> >>> Hi Nikolai,
>> >> >>>
>> >> >>> Thanks for those i

Re: Getting the counters with the highest values

2014-11-24 Thread Eric Stevens
You're right that there's no way to use the counter data type to
materialize a view ordered by the counter.  Computing this post hoc is the
way to go if your needs allow for it (if not, something like Summingbird or
vanilla Storm may be necessary).

I might suggest that you make your primary key for your running totals by
day table be ((day), doc_id) because it will make it easy to compute the
materialized ordered view (SELECT doc_id, count FROM running_totals WHERE
day=?) unless you expect to have a really large number of documents getting
counts each day.

For your materialized ordering, I'd suggest a primary key of ((day), count)
as then for a given day you'll be able to select top by count (SELECT
count, doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).

One more thing to consider if your users are not all in a single timezone
is having your time bucket be hour instead of day so that you can answer by
day goal posted by local midnight (except the handful of locations that use
half hour timezone offsets) instead of a single global midnight.  You can
then either include either just each hour in each row (and aggregate at
read time), or you can make each row a rolling 24 hours (aggregating at
write time), depending on which use case fits your needs better.

On Sun Nov 23 2014 at 8:42:11 AM Robert Wille  wrote:

> I’m working on moving a bunch of counters out of our relational database
> to Cassandra. For the most part, Cassandra is a very nice fit, except for
> one feature on our website. We manage a time series of view counts for each
> document, and display a list of the most popular documents in the last
> seven days. This seems like a pretty strong anti-pattern for Cassandra, but
> also seems like something a lot of people would want to do. If you’re
> keeping counters, its pretty likely that you’d want to know which ones have
> the highest counts.
>
> Here’s what I came up with to implement this feature. Create a counter
> table with primary key (doc_id, day) and a single counter. Whenever a
> document is viewed, increment the counter for the document for today and
> the previous six days. Sometime after midnight each day, compile the
> counters into a table with primary key (day, count, doc_id) and no
> additional columns. For each partition in the counter table, I would sum up
> the counters, delete any counters that are over a week old, and put the sum
> into the second table with day = today. When I query the table, i would ask
> for data where day = yesterday. During the compilation process, I would
> delete old partitions. In theory I’d only need two partitions. One that is
> being built, and one for querying.
>
> I’d be interested to hear critiques on this strategy, as well as hearing
> how other people have implemented a "most-popular" feature using Cassandra
> counters.
>
> Robert
>
>


Re: Getting the counters with the highest values

2014-11-24 Thread Robert Wille
We do get a large number of documents getting counts each day, which is why I’m 
thinking the running totals table be ((doc_id), day) rather than ((day), 
doc_id). We have too many documents per day to materialize in memory, so 
querying per day and aggregating the results isn’t really possible.

I’m planning on bucketing the materialized ordering because we get enough 
unique document views per day that the rows will be quite large. Not so large 
as to be unmanageable, but pushing the limits. If we were so lucky as to get a 
significant increase in traffic, I might start having issues. I didn’t include 
bucketing in my post because I didn’t want to complicate my question. I hadn’t 
considered that I could bucket by hour and then use a local midnight instead of 
a global midnight. Interesting idea.

Thanks for your response.

Robert

On Nov 24, 2014, at 9:40 AM, Eric Stevens 
mailto:migh...@gmail.com>> wrote:

You're right that there's no way to use the counter data type to materialize a 
view ordered by the counter.  Computing this post hoc is the way to go if your 
needs allow for it (if not, something like Summingbird or vanilla Storm may be 
necessary).

I might suggest that you make your primary key for your running totals by day 
table be ((day), doc_id) because it will make it easy to compute the 
materialized ordered view (SELECT doc_id, count FROM running_totals WHERE 
day=?) unless you expect to have a really large number of documents getting 
counts each day.

For your materialized ordering, I'd suggest a primary key of ((day), count) as 
then for a given day you'll be able to select top by count (SELECT count, 
doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).

One more thing to consider if your users are not all in a single timezone is 
having your time bucket be hour instead of day so that you can answer by day 
goal posted by local midnight (except the handful of locations that use half 
hour timezone offsets) instead of a single global midnight.  You can then 
either include either just each hour in each row (and aggregate at read time), 
or you can make each row a rolling 24 hours (aggregating at write time), 
depending on which use case fits your needs better.

On Sun Nov 23 2014 at 8:42:11 AM Robert Wille 
mailto:rwi...@fold3.com>> wrote:
I’m working on moving a bunch of counters out of our relational database to 
Cassandra. For the most part, Cassandra is a very nice fit, except for one 
feature on our website. We manage a time series of view counts for each 
document, and display a list of the most popular documents in the last seven 
days. This seems like a pretty strong anti-pattern for Cassandra, but also 
seems like something a lot of people would want to do. If you’re keeping 
counters, its pretty likely that you’d want to know which ones have the highest 
counts.

Here’s what I came up with to implement this feature. Create a counter table 
with primary key (doc_id, day) and a single counter. Whenever a document is 
viewed, increment the counter for the document for today and the previous six 
days. Sometime after midnight each day, compile the counters into a table with 
primary key (day, count, doc_id) and no additional columns. For each partition 
in the counter table, I would sum up the counters, delete any counters that are 
over a week old, and put the sum into the second table with day = today. When I 
query the table, i would ask for data where day = yesterday. During the 
compilation process, I would delete old partitions. In theory I’d only need two 
partitions. One that is being built, and one for querying.

I’d be interested to hear critiques on this strategy, as well as hearing how 
other people have implemented a "most-popular" feature using Cassandra counters.

Robert




Re: Repair completes successfully but data is still inconsistent

2014-11-24 Thread André Cruz
On 21 Nov 2014, at 19:01, Robert Coli  wrote:
> 
> 2- Why won’t repair propagate this column value to the other nodes? Repairs 
> have run everyday and the value is still missing on the other nodes.
> 
> No idea. Are you sure it's not expired via TTL or masked in some other way? 
> When you ask that node for it at CL.ONE, do you get this value?

This data does not use TTLs. What other reason could there be for a mask? If I 
connect using cassandra-cli to that specific node, which becomes the 
coordinator, is it guaranteed to not ask another node when CL is ONE and it 
contains that row?

>  
> Cassandra has never stored data consistently except by fortunate accident.”

I wish I had read that a few years back. :)

Thank you,
André

Re: Repair completes successfully but data is still inconsistent

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 10:39 AM, André Cruz  wrote:

> This data does not use TTLs. What other reason could there be for a mask?
> If I connect using cassandra-cli to that specific node, which becomes the
> coordinator, is it guaranteed to not ask another node when CL is ONE and it
> contains that row?
>

Other than rare cases of writing "doomstones" (deletes with timestamps in
the future) not sure in what case this might occur.

But for any given value on any given node, you can verify the value it has
in 100% of SStables... that's what both the normal read path and repair
should do when reconciling row fragments into the materialized row? Hard to
understand a case where repair fails, and I might provide that set of
SStables attached to an Apache JIRA.

=Rob


Re: Compaction Strategy guidance

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 6:48 AM, Nikolai Grigoriev 
wrote:

> One of the obvious recommendations I have received was to run more than
> one instance of C* per host. Makes sense - it will reduce the amount of
> data per node and will make better use of the resources.
>

This is usually a Bad Idea to do in production.

=Rob


What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException?

2014-11-24 Thread Kevin Burton
I’m trying to track down some exceptions in our production cluster.  I
bumped up our write load and now I’m getting a non-trivial number of these
exceptions.  Somewhere on the order of 100 per hour.

All machines have a somewhat high CPU load because they’re doing other
tasks.  I’m worried that perhaps my background tasks are just overloading
cassandra and one way to mitigate this is to nice them to least favorable
priority (this is my first tasks).

But I can’t seem to really track down any documentation on HOW to tune
cassandra to prevent these. I mean I get the core theory behind all of this
just need to track down the docs so I can actually RTFM :)



-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




Re: What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException?

2014-11-24 Thread Bulat Shakirzyanov
Check out Ruby Driver documentation on these topics:

Error Handling

Retry Policies


While the documentation is for the Ruby Driver, the concepts were borrowed
from and map directly to the Java Driver

Cheers,

On Mon, Nov 24, 2014 at 12:57 PM, Kevin Burton  wrote:

> I’m trying to track down some exceptions in our production cluster.  I
> bumped up our write load and now I’m getting a non-trivial number of these
> exceptions.  Somewhere on the order of 100 per hour.
>
> All machines have a somewhat high CPU load because they’re doing other
> tasks.  I’m worried that perhaps my background tasks are just overloading
> cassandra and one way to mitigate this is to nice them to least favorable
> priority (this is my first tasks).
>
> But I can’t seem to really track down any documentation on HOW to tune
> cassandra to prevent these. I mean I get the core theory behind all of this
> just need to track down the docs so I can actually RTFM :)
>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> 
>
>


-- 
*Bulat Shakirzyanov* | Software Alchemist

*a: *about.me/avalanche123
*e:* mallluh...@gmail.com


Re: What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException?

2014-11-24 Thread Shane Hansen
Not sure if this is what you're looking for, but api docs can be useful (I
won't copy/paste the docs themselves)

http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/exceptions/NoHostAvailableException.html

http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/exceptions/WriteTimeoutException.html
(Not very helpful docs,
but I would assume this shows up when the write_timeout_ms parameter from
cassandra.yaml is exceeded.

http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/exceptions/UnavailableException.html

jvisualvm or jmc will have some good information on average write latencies
which might point you in the right direction.


On Mon, Nov 24, 2014 at 1:05 PM, Bulat Shakirzyanov 
wrote:

> Check out Ruby Driver documentation on these topics:
>
> Error Handling
> 
> Retry Policies
> 
>
> While the documentation is for the Ruby Driver, the concepts were borrowed
> from and map directly to the Java Driver
>
> Cheers,
>
> On Mon, Nov 24, 2014 at 12:57 PM, Kevin Burton  wrote:
>
>> I’m trying to track down some exceptions in our production cluster.  I
>> bumped up our write load and now I’m getting a non-trivial number of these
>> exceptions.  Somewhere on the order of 100 per hour.
>>
>> All machines have a somewhat high CPU load because they’re doing other
>> tasks.  I’m worried that perhaps my background tasks are just overloading
>> cassandra and one way to mitigate this is to nice them to least favorable
>> priority (this is my first tasks).
>>
>> But I can’t seem to really track down any documentation on HOW to tune
>> cassandra to prevent these. I mean I get the core theory behind all of this
>> just need to track down the docs so I can actually RTFM :)
>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>> 
>>
>>
>
>
> --
> *Bulat Shakirzyanov* | Software Alchemist
>
> *a: *about.me/avalanche123
> *e:* mallluh...@gmail.com
>


Re: What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException?

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 12:57 PM, Kevin Burton  wrote:

> I’m trying to track down some exceptions in our production cluster.  I
> bumped up our write load and now I’m getting a non-trivial number of these
> exceptions.  Somewhere on the order of 100 per hour.
>
> All machines have a somewhat high CPU load because they’re doing other
> tasks.  I’m worried that perhaps my background tasks are just overloading
> cassandra and one way to mitigate this is to nice them to least favorable
> priority (this is my first tasks).
>

Two out of three of them are timeouts or lack of availability. Seeing this
across your cluster is usually associated with hitting a "pre-fail"
condition in terms of GC, where the amount of data stored per node makes
the steady state working set larger than available non-fragmented heap. If
you're graphing GC time, I would expect to see a concomitant spike there.

=Rob


Re: What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException?

2014-11-24 Thread Parag Shah
In our case, the timeouts were happening because internode authentication was 
turned on and by default the user column family in the system_auth keyspace is 
replicated only on 1 node. We also had to tune the permissions_validity_in_ms 
from the default of 2000 ms to a larger value. The issue was that all 
authentication requests would go to one node, since it was replicated only on 1 
node. We set replication factor to n (# of nodes) on the system_auth keyspace.

Hope this helps.

Parag

From: Robert Coli mailto:rc...@eventbrite.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Monday, November 24, 2014 at 2:52 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: What causes NoHostAvailableException, WriteTimeoutException, and 
UnavailableException?

On Mon, Nov 24, 2014 at 12:57 PM, Kevin Burton 
mailto:bur...@spinn3r.com>> wrote:
I’m trying to track down some exceptions in our production cluster.  I bumped 
up our write load and now I’m getting a non-trivial number of these exceptions. 
 Somewhere on the order of 100 per hour.

All machines have a somewhat high CPU load because they’re doing other tasks.  
I’m worried that perhaps my background tasks are just overloading cassandra and 
one way to mitigate this is to nice them to least favorable priority (this is 
my first tasks).

Two out of three of them are timeouts or lack of availability. Seeing this 
across your cluster is usually associated with hitting a "pre-fail" condition 
in terms of GC, where the amount of data stored per node makes the steady state 
working set larger than available non-fragmented heap. If you're graphing GC 
time, I would expect to see a concomitant spike there.

=Rob



Cassandra version 1.0.10 Data Loss upon restart

2014-11-24 Thread Ankit Patel
We are experiencing data loss with Cassandra 1.0.10 when we had restarted
the without flushing. We see in the cassandra logs that the commitlogs were
read back without any problems. Until the restart the data was correct.
However, after the node restarted we retrieved older version of the data
(row caching is turned off). We are reading/writing to a single cassandra
node that is replicated to a single node setup at another data center. The
times are synchronized across our machines. Has anyone experienced this
type of behavior?


Thanks,

Ankit Patel


large range read in Cassandra

2014-11-24 Thread Dan Kinder
Hi,

We have a web crawler project currently based on Cassandra (
https://github.com/iParadigms/walker, written in Go and using the gocql
driver), with the following relevant usage pattern:

- Big range reads over a CF to grab potentially millions of rows and
dispatch new links to crawl
- Fast insert of new links (effectively using Cassandra to deduplicate)

We ultimately planned on doing the batch processing step (the dispatching)
in a system like Spark, but for the time being it is also in Go. We believe
this should work fine given that Cassandra now properly allows chunked
iteration of columns in a CF.

The issue is, periodically while doing a particularly large range read,
other operations time out because that node is "busy". In an experimental
cluster with only two nodes (and replication factor of 2), I'll get an
error like: "Operation timed out - received only 1 responses." Indicating
that the second node took too long to reply. At the moment I have the long
range reads set to consistency level ANY but the rest of the operations are
on QUORUM, so on this cluster they require responses from both nodes. The
relevant CF is also using LeveledCompactionStrategy. This happens in both
Cassandra 2 and 2.1.

Despite this error I don't see any significant I/O, memory consumption, or
CPU usage.

Here are some of the configuration values I've played with:

Increasing timeouts:
read_request_timeout_in_ms:
15000
range_request_timeout_in_ms:
3
write_request_timeout_in_ms:
1
request_timeout_in_ms: 1

Getting rid of caches we don't need:
key_cache_size_in_mb: 0
row_cache_size_in_mb: 0

Each of the 2 nodes has an HDD for commit log and single HDD I'm using for
data. Hence the following thread config (maybe since I/O is not an issue I
should increase these?):
concurrent_reads: 16
concurrent_writes: 32
concurrent_counter_writes: 32

Because I have a large number columns and aren't doing random I/O I've
increased this:
column_index_size_in_kb: 2048

It's something of a mystery why this error comes up. Of course with a 3rd
node it will get masked if I am doing QUORUM operations, but it still seems
like it should not happen, and that there is some kind of head-of-line
blocking or other issue in Cassandra. I would like to increase the amount
of dispatching I'm doing because of this it bogs it down if I do.

Any suggestions for other things we can try here would be appreciated.

-dan


Re: large range read in Cassandra

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder  wrote:

> We have a web crawler project currently based on Cassandra (
> https://github.com/iParadigms/walker, written in Go and using the gocql
> driver), with the following relevant usage pattern:
>
> - Big range reads over a CF to grab potentially millions of rows and
> dispatch new links to crawl
>

If you really mean millions of storage rows, this is just about the worst
case for Cassandra. The problem you're having is probably that you
shouldn't try to do this in Cassandra.

Your timeouts are either from the read actually taking longer than the
timeout or from the reads provoking heap pressure and resulting GC.

=Rob


Re: Cassandra version 1.0.10 Data Loss upon restart

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 5:51 PM, Robert Coli  wrote:

> What is your replication factor? What CL are you using to read?
>

Ah, I see from OP that RF is 1.

As a general statement, RF=1 is an edge case which very, very few people
have ever operated in production. It is relatively likely that there are
some undiscovered edge cases which relate to it.

That said, this would be a particularly glaring one, which I would expect
to be discovered in other contexts.

=Rob


Re: Cassandra version 1.0.10 Data Loss upon restart

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 3:19 PM, Ankit Patel  wrote:

> We are experiencing data loss with Cassandra 1.0.10 when we had restarted
> the without flushing. We see in the cassandra logs that the commitlogs were
> read back without any problems. Until the restart the data was correct.
> However, after the node restarted we retrieved older version of the data
> (row caching is turned off). We are reading/writing to a single cassandra
> node that is replicated to a single node setup at another data center. The
> times are synchronized across our machines. Has anyone experienced this
> type of behavior?
>

I'm not sure I've heard of a particular issue where data is not correctly
replayed after restart.

But 1.0.10 is the era of :

https://issues.apache.org/jira/browse/CASSANDRA-4446

"nodetool drain sometimes doesn't mark commitlog fully flushed"

... and there have been other bugs relating to commitlog replay since...

What is your replication factor? What CL are you using to read?

If you look in SStables on the affected node, which versions of the row
exist?

=Rob


Re: What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException?

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 3:01 PM, Parag Shah  wrote:

>  In our case, the timeouts were happening because internode
> authentication was turned on and by default the user column family in the
> system_auth keyspace is replicated only on 1 node. We also had to tune the
> permissions_validity_in_ms from the default of 2000 ms to a larger value.
> The issue was that all authentication requests would go to one node, since
> it was replicated only on 1 node. We set replication factor to n (# of
> nodes) on the system_auth keyspace.
>

*Very* good note, in the future will be sure to ask if authentication is
enabled, as this has come up more and more recently. :D

=Rob


Re: Problem with performance, memory consumption, and RLIMIT_MEMLOCK

2014-11-24 Thread Dmitri Dmitrienko
Hi Jens,
I solved the problem by switching to PAGING mode. In this case it works
smooth and does not require so many locks.
It was not clear in the beginning and the only sample demonstrated
corresponding API (functions like cass_result_has_more_pages()) is
"paging". Hope this helps somebody.


On Sat, Nov 22, 2014 at 4:56 PM, Jens Rantil  wrote:

> Hi Dmitri,
>
> I have not used the CPP driver, but maybe you have forgotten set the
> equivalent of the Iava driver's fetchsize to something sensible?
>
> Just an idea,
> Jens
>
> —
> Sent from Mailbox 
>
>
> On Sun, Nov 16, 2014 at 6:09 PM, Dmitri Dmitrienko 
> wrote:
>
>> Hi,
>> I have a very simple table in cassandra that contains only three columns:
>> id, time and blob with data. I added 1M rows of data and now the database
>> is about 12GB on disk.
>> 1M is only part of data I want to store in the database, it's necessary
>> to synchronize this table with external source. In order to do this, I have
>> to read id and time columns of all the rows and compare them with what I
>> see in the external source and insert/update/delete the rows where I see a
>> difference.
>> So, I'm trying to fetch id and time columns from cassandra. All of sudden
>> in all 100% my attempts, server hangs for ~ 1minute, while doing so it
>> loads >100% CPU, then abnormally terminates with error saying I have to run
>> cassandra as root or increase RLIMIT_MEMLOCK.
>> I increased RLIMIT_MEMLOCK to 1GB and seems it still is not sufficient.
>> It seems cassandra tries to read and lock whole the table in memory,
>> ignoring the fact that I need only two tiny columns (~12MB of data).
>>
>> This is how it works when I use the latest cpp-driver.
>> With cqlsh it works differently -- it show first page of data almost
>> immediately, without any sensible delay.
>> Is there a way to have cpp-driver working like cqlsh? I'd like to have
>> data sent to the client immediately upon availability without any attempts
>> to lock huge chunks of virtual memory.
>> My platform is 64bit linux (centos) with all necessary updates installed,
>> openjdk. I also tried macosx with oracle jdk. In this case I don't get
>> RLIMIT_MEMLOCK, but regular out of memory error in system.log, although I
>> provided server with sufficiently large heap, as recommended, 8GB.
>>
>>
>