[jira] [Comment Edited] (CASSANDRA-1632) Thread workflow and cpu affinity

Benedict (JIRA) Thu, 21 Nov 2013 13:05:06 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828845#comment-13828845
 ]


Benedict edited comment on CASSANDRA-1632 at 11/21/13 1:21 PM:
---------------------------------------------------------------

Thanks for the full write up.

To expand a little on this, on the paths tested by stress there are currently 
between 3 and 5 threads a request goes through, depending on the route taken. 
Requests that go to the "wrong" server (and are re-routed), which is a majority 
for stress as it stands, go Thrift/Netty->OTC; and on the correct server go 
ITC->RS/WS->OTC. There isn't a lot that can be done to reduce the hand-off 
here, although if one day we had an in-process cache, we might be able to skip 
the RS for requests that can be handled from the cache. It's also possible we 
could do this for the WS once we have a non-blocking write path.

For requests going to the "correct" server we do often have an unnecessary step 
of Thrift/Netty->RS/WS(->Thrift/Netty) . I have a patch that skips the last two 
stages, by using a TPE that permits same-thread execution if it has idle 
threads and the calling thread is registered to support it (and forbids a 
thread in the pool from activating until the execution completes). This gives a 
15% bump in single node performance, but without smart routing this is rapidly 
lost amongst the cluster. However since we have "smart" routing in the Java 
driver, this may be worth reconsidering - the only problem being the 
optimisation only really works with blocking IO, and our native protocol 
currently only supports non-blocking IO.



was (Author: benedict):
Thanks for the full write up.

To expand a little on this, on the paths tested by stress there are currently 
between 3 and 5 threads a request goes through, depending on the route taken. 
Requests that go to the "wrong" server (and are re-routed), which is a majority 
for stress as it stands, go Thrift/Netty->OTC; and on the correct server go 
ITC->RS/WS->OTC. There isn't a lot that can be done to reduce the hand-off 
here, although if one day we had an in-process cache, we might be able to skip 
the RS for requests that can be handled from the cache. It's also possible we 
could do this for the WS once we have a non-blocking write path.

For requests going to the "correct" server we do often have an unnecessary step 
of Thrift/Netty->RS/WS->OTC . I have a patch that skips the middle stage, by 
using a TPE that permits same-thread execution if it has idle threads and the 
calling thread is registered to support it (and forbids a thread in the pool 
from activating until the execution completes). This gives a 15% bump in single 
node performance, but without smart routing this is rapidly lost amongst the 
cluster. However since we have "smart" routing in the Java driver, this may be 
worth reconsidering.


> Thread workflow and cpu affinity
> --------------------------------
>
>                 Key: CASSANDRA-1632
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1632
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Jason Brown
>              Labels: performance
>         Attachments: 1632_batchRead-v1.diff, threadAff_reads.txt, 
> threadAff_writes.txt
>
>
> Here are some thoughts I wanted to write down, we need to run some serious 
> benchmarks to see the benefits:
> 1) All thread pools for our stages use a shared queue per stage. For some 
> stages we could move to a model where each thread has its own queue. This 
> would reduce lock contention on the shared queue. This workload only suits 
> the stages that have no variance, else you run into thread starvation. Some 
> stages that this might work: ROW-MUTATION.
> 2) Set cpu affinity for each thread in each stage. If we can pin threads to 
> specific cores, and control the workflow of a message from Thrift down to 
> each stage, we should see improvements on reducing L1 cache misses. We would 
> need to build a JNI extension (to set cpu affinity), as I could not find 
> anywhere in JDK where it was exposed. 
> 3) Batching the delivery of requests across stage boundaries. Peter Schuller 
> hasn't looked deep enough yet into the JDK, but he thinks there may be 
> significant improvements to be had there. Especially in high-throughput 
> situations. If on each consumption you were to consume everything in the 
> queue, rather than implying a synchronization point in between each request.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (CASSANDRA-1632) Thread workflow and cpu affinity

Reply via email to