Meaning of values in tpstats

2011-12-10 Thread Philippe
Hello,

Here's an example tpstats on one node in my cluster. I only issue
multigetslice reads to counter columns
Pool NameActive   Pending  Completed   Blocked  All
time blocked
ReadStage27  2166 3565927301 0
0
MutationStage 1 1   55802973 0
0

What does ReadStage.Pending exactly mean ?

   1. the number of keys left to query (because I batch) ?
   2. or the number of multigetslice requests issued to that node for
   execution ?

Same question for MutationStage (mutating counter columns only)

Thanks


Re: memory leaks in 1.0.5

2011-12-10 Thread Radim Kolar



and rows forever stuck in HintsColumnFamily
You need to remove the hints data files to clear out the incomplete
hints from<  1.0.3;

I did. hints there are slowly increasing. I checked it today.


Re: Meaning of values in tpstats

2011-12-10 Thread Peter Schuller
> Pool Name                    Active   Pending      Completed   Blocked  All
> time blocked
> ReadStage                        27      2166     3565927301         0

In general, "active" refers to work that is being executed right now,
"pending" refers to work that is waiting to be executed (go into
"active"), and completed is the cumulative all-time (since node start)
count of the number of tasks executed.

With the slicing, I'm not sure off the top of my head. I'm sure
someone else can chime in. For e.g. a multi-get, they end up as
independent tasks.

Typically having pending persistently above 0 for ReadStage or
MutationStage, especially if more than a hand-ful, means that you are
having a performance issue - either capacity problem or something
else, as incoming requests will have to wait to be services. Typically
the most common effect is that you are bottlenecking on I/O and
ReadStage pending shoots through the roof.

There are exceptions. If you e.g. submit a really large multi-get of
5000, that will naturally lead to a spike (and if all 5000 of them
need to go down to disk, the spike will survive for a bit). If you are
ONLY doing these queries, that's not a problem per se. But if you are
also expecting other requests to have low latency, then you want to
avoid it.

In general, batching is good - but don't overdo it, especially for
reads, and especially if you're going to disk for the workload.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: ParNew and caching

2011-12-10 Thread Peter Schuller
> After re-reading my post, what I meant to say is that I switched from
> Serializing cache provider to ConcurrentLinkedHash cache provider and then
> saw better performance, but still far worse than no caching at all:
>
> - no caching at all : 25-30ms
> - with Serializing provider : 1300+ms
> - with Concurrent provider : 500ms
>
> 100% cache hit rate.  ParNew is the only stat that I see out of line, so
> seems like still a lot of copying

In general, if you want to get to the bottom of this stuff and you
think GC is involved, always run with -XX:+PrintGC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps so that the GC activity
can be observed.

1300+ should not be GC unless you are having fallbacks to full GC:s
(would be possible to see with gc logging) and it should definitely be
possible to avoid full gc:s being extremely common (but eliminating
them entirely may not be possible).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: CPU bound workload

2011-12-10 Thread Peter Schuller
> I've got a batch process running every so often that issues a bunch of
> counter increments. I have noticed that when this process runs without being
> throttled it will raise the CPU to 80-90% utilization on the nodes handling
> the requests. This in turns timeouts and general lag on queries running on
> the cluster.

This much is entirely expected. If you are not bottlenecking anywhere
else and saturing the cluster, you will be bound by it, and it will
affect the latency of other traffic, no matter how fast or slow
Cassandra is.

You do say "nodes handling the requests". Two things to always keep in
mind is to (1) spread the requests evenly across all members of the
cluster, and (2) if you are doing a lot of work per row key, spread it
around and be concurrent so that you're not hitting a single row at a
time, which will be under the responsibility of a single set of RF
nodes (you want to put load on the entire cluster evently if you want
to maximize throughput).

> Is there anything that can be done to increase the throughput, I've been
> looking on the wiki and the mailing list and didn't find any optimization
> suggestions (apart from spreading the load on more nodes).
>
> Cluster is 5 node, BOP, RF=3, AMD opteron 4174 CPU (6 x 2.3 Ghz cores),
> Gigabit ethernet, RAID-0 SATA2 disks

For starters, what *is* the throughput? How many counter mutations are
you submitting per second?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Meaning of values in tpstats

2011-12-10 Thread Edward Capriolo
There was a recent patch that fixed an issue where counters were hitting
the same natural endpoint rather then being randomized across all of them.

On Saturday, December 10, 2011, Peter Schuller 
wrote:
>> Pool NameActive   Pending  Completed   Blocked
 All
>> time blocked
>> ReadStage27  2166 3565927301 0
>
> In general, "active" refers to work that is being executed right now,
> "pending" refers to work that is waiting to be executed (go into
> "active"), and completed is the cumulative all-time (since node start)
> count of the number of tasks executed.
>
> With the slicing, I'm not sure off the top of my head. I'm sure
> someone else can chime in. For e.g. a multi-get, they end up as
> independent tasks.
>
> Typically having pending persistently above 0 for ReadStage or
> MutationStage, especially if more than a hand-ful, means that you are
> having a performance issue - either capacity problem or something
> else, as incoming requests will have to wait to be services. Typically
> the most common effect is that you are bottlenecking on I/O and
> ReadStage pending shoots through the roof.
>
> There are exceptions. If you e.g. submit a really large multi-get of
> 5000, that will naturally lead to a spike (and if all 5000 of them
> need to go down to disk, the spike will survive for a bit). If you are
> ONLY doing these queries, that's not a problem per se. But if you are
> also expecting other requests to have low latency, then you want to
> avoid it.
>
> In general, batching is good - but don't overdo it, especially for
> reads, and especially if you're going to disk for the workload.
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: ParNew and caching

2011-12-10 Thread Edward Capriolo
I am not sure if there is a ticket on this but I have always thought the
row cache should not bother caching an entry bigger then n columns.

Murmurs of a slice cache might help as well.

On Saturday, December 10, 2011, Peter Schuller 
wrote:
>> After re-reading my post, what I meant to say is that I switched from
>> Serializing cache provider to ConcurrentLinkedHash cache provider and
then
>> saw better performance, but still far worse than no caching at all:
>>
>> - no caching at all : 25-30ms
>> - with Serializing provider : 1300+ms
>> - with Concurrent provider : 500ms
>>
>> 100% cache hit rate.  ParNew is the only stat that I see out of line, so
>> seems like still a lot of copying
>
> In general, if you want to get to the bottom of this stuff and you
> think GC is involved, always run with -XX:+PrintGC -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps so that the GC activity
> can be observed.
>
> 1300+ should not be GC unless you are having fallbacks to full GC:s
> (would be possible to see with gc logging) and it should definitely be
> possible to avoid full gc:s being extremely common (but eliminating
> them entirely may not be possible).
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: CPU bound workload

2011-12-10 Thread Edward Capriolo
Counter increment is a special case in cassandra because the incur a local
read before write. Normal column writes so not do this. So counter writes
are intensive. If possible batch up the increments for less rpc calls and
less reads.

On Saturday, December 10, 2011, Peter Schuller 
wrote:
>> I've got a batch process running every so often that issues a bunch of
>> counter increments. I have noticed that when this process runs without
being
>> throttled it will raise the CPU to 80-90% utilization on the nodes
handling
>> the requests. This in turns timeouts and general lag on queries running
on
>> the cluster.
>
> This much is entirely expected. If you are not bottlenecking anywhere
> else and saturing the cluster, you will be bound by it, and it will
> affect the latency of other traffic, no matter how fast or slow
> Cassandra is.
>
> You do say "nodes handling the requests". Two things to always keep in
> mind is to (1) spread the requests evenly across all members of the
> cluster, and (2) if you are doing a lot of work per row key, spread it
> around and be concurrent so that you're not hitting a single row at a
> time, which will be under the responsibility of a single set of RF
> nodes (you want to put load on the entire cluster evently if you want
> to maximize throughput).
>
>> Is there anything that can be done to increase the throughput, I've been
>> looking on the wiki and the mailing list and didn't find any optimization
>> suggestions (apart from spreading the load on more nodes).
>>
>> Cluster is 5 node, BOP, RF=3, AMD opteron 4174 CPU (6 x 2.3 Ghz cores),
>> Gigabit ethernet, RAID-0 SATA2 disks
>
> For starters, what *is* the throughput? How many counter mutations are
> you submitting per second?
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: CPU bound workload

2011-12-10 Thread Peter Schuller
> Counter increment is a special case in cassandra because the incur a local
> read before write. Normal column writes so not do this. So counter writes
> are intensive. If possible batch up the increments for less rpc calls and
> less reads.

Note though that the CPU usage impact of this should be limited, in
comparison to the impact when your reads end up going down to disk.
I.e., the most important performance characteristic for people to
consider is that counter writes may need to read from disk, contrary
to all other writes in Cassandra which will only imply sequential I/O
asynchronously.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: best practices for simulating transactions in Cassandra

2011-12-10 Thread Guy Incognito

you could try writing with the clock of the initial replay entry?

On 06/12/2011 20:26, John Laban wrote:
Ah, neat.  It is similar to what was proposed in (4) above with adding 
transactions to Cages, but instead of snapshotting the data to be 
rolled back (the "before" data), you snapshot the data to be replayed 
(the "after" data).  And then later, if you find that the transaction 
didn't complete, you just keep replaying the transaction until it takes.


The part I don't understand with this approach though:  how do you 
ensure that someone else didn't change the data between your initial 
failed transaction and the later replaying of the transaction?  You 
could get lost writes in that situation.


Dominic (in the Cages blog post) explained a workaround with that for 
his rollback proposal:  all subsequent readers or writers of that data 
would have to check for abandoned transactions and roll them back 
themselves before they could read the data.  I don't think this is 
possible with the XACT_LOG "replay" approach in these slides though, 
based on how the data is indexed (cassandra node token + timeUUID).



PS:  How are you liking Cages?




2011/12/6 Jérémy SEVELLEC >


Hi John,

I had exactly the same reflexions.

I'm using zookeeper and cage to lock et isolate.

but how to rollback?
It's impossible so try replay!

the idea is explained in this presentation
http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
from slide 24)

- insert your whole data into one column
- make the job
- remove (or expire) your column.

if there is a problem during "making the job", you keep the
possibility to replay and replay and replay (synchronously or in a
batch).

Regards

Jérémy


2011/12/5 John Laban mailto:j...@pagerduty.com>>

Hello,

I'm building a system using Cassandra as a datastore and I
have a few places where I am need of transactions.

I'm using ZooKeeper to provide locking when I'm in need of
some concurrency control or isolation, so that solves that
half of the puzzle.

What I need now is to sometimes be able to get atomicity
across multiple writes by simulating the
"begin/rollback/commit" abilities of a relational DB.  In
other words, there are places where I need to perform multiple
updates/inserts, and if I fail partway through, I would
ideally be able to rollback the partially-applied updates.

Now, I *know* this isn't possible with Cassandra.  What I'm
looking for are all the best practices, or at least tips and
tricks, so that I can get around this limitation in Cassandra
and still maintain a consistent datastore.  (I am using quorum
reads/writes so that eventual consistency doesn't kick my ass
here as well.)

Below are some ideas I've been able to dig up.  Please let me
know if any of them don't make sense, or if there are better
approaches:


1) Updates to a row in a column family are atomic.  So try to
model your data so that you would only ever need to update a
single row in a single CF at once.  Essentially, you model
your data around transactions.  This is tricky but can
certainly be done in some situations.

2) If you are only dealing with multiple row *inserts* (and
not updates), have one of the rows act as a 'commit' by
essentially validating the presence of the other rows.  For
example, say you were performing an operation where you wanted
to create an Account row and 5 User rows all at once (this is
an unlikely example, but bear with me).  You could insert 5
rows into the Users CF, and then the 1 row into the Accounts
CF, which acts as the commit.  If something went wrong before
the Account could be created, any Users that had been created
so far would be orphaned and unusable, as your business logic
can ensure that they can't exist without an Account.  You
could also have an offline cleanup process that swept away
orphans.

3) Try to model your updates as idempotent column inserts
instead.  How do you model updates as inserts?  Instead of
munging the value directly, you could insert a column
containing the operation you want to perform (like "+5").  It
would work kind of like the Consistent Vote Counting
implementation: ( https://gist.github.com/41 ).  How do
you make the inserts idempotent?  Make sure the column names
correspond to a request ID or some other identifier that would
be identical across re-drives of a given (perhaps originally
failed) request.  This could leave your datastore in a
temporarily inconsistent state, but would eventually become
consistent af