Hi Ben,

Thanks for your reply, we tested the same workload on kernel
version 4.6.4-1.el7.elrepo.x86_64 and found the issue to be not present
there.

This had resulted in really high CPU in write workloads -> area in which
cassandra excels. Degrading performance by atleast 5x! I suggest this
mention could be included in cassandra community wiki as it could impact a
large audience.

Thanks & Regards,
Bhuvan

On Tue, Nov 15, 2016 at 12:33 PM, Ben Bromhead <b...@instaclustr.com> wrote:

> Hi Abhishek
>
> The article with the futex bug description lists the solution, which is to
> upgrade to a version of RHEL or CentOS that have the specified patch.
>
> What help do you specifically need? If you need help upgrading the OS I
> would look at the documentation for RHEL or CentOS.
>
> Ben
>
> On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta <gupta.abhis...@snapdeal.com>
> wrote:
>
> Hi,
>
> We are seeing an issue where the system CPU is shooting off to a figure or
> > 90% when the cluster is subjected to a relatively high write workload i.e
> 4k wreq/secs.
>
> 2016-11-14T13:27:47.900+0530 Process summary
>   process cpu=695.61%
>   application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High
> System CPU *
>   other: cpu=19.49%
>   heap allocation rate *403mb*/s
> [000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129
> [000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34
> [000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56
> [000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79
> [000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78
> [000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41
>
> On doing strace it was found that the following system call is consuming
> all the system CPU
>  timeout 10s strace -f -p 5954 -c -q
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>
> *88.33 1712.798399       16674    102723     22191 futex* 3.98
> 77.098730        4356     17700           read
>  3.27   63.474795      394253       161        29 restart_syscall
>  3.23   62.601530       29768      2103           epoll_wait
>
> On searching we found the following bug with the RHEL 6.6, CentOS 6.6
> kernel seems to be a probable cause for the issue:
>
> https://docs.datastax.com/en/landing_page/doc/landing_page/
> troubleshooting/cassandra/fetuxWaitBug.html
>
> The patch fix mentioned in the doc is also not present in our kernel.
>
> sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
> - [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347]
> {CVE-2010-0623}
>
> Can some who has faced and resolved this issue help us here.
>
> Thanks,
> Abhishek
>
>
> --
> Ben Bromhead
> CTO | Instaclustr <https://www.instaclustr.com/>
> +1 650 284 9692
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>

Reply via email to