Hi Ben, Thanks for your reply, we tested the same workload on kernel version 4.6.4-1.el7.elrepo.x86_64 and found the issue to be not present there.
This had resulted in really high CPU in write workloads -> area in which cassandra excels. Degrading performance by atleast 5x! I suggest this mention could be included in cassandra community wiki as it could impact a large audience. Thanks & Regards, Bhuvan On Tue, Nov 15, 2016 at 12:33 PM, Ben Bromhead <b...@instaclustr.com> wrote: > Hi Abhishek > > The article with the futex bug description lists the solution, which is to > upgrade to a version of RHEL or CentOS that have the specified patch. > > What help do you specifically need? If you need help upgrading the OS I > would look at the documentation for RHEL or CentOS. > > Ben > > On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta <gupta.abhis...@snapdeal.com> > wrote: > > Hi, > > We are seeing an issue where the system CPU is shooting off to a figure or > > 90% when the cluster is subjected to a relatively high write workload i.e > 4k wreq/secs. > > 2016-11-14T13:27:47.900+0530 Process summary > process cpu=695.61% > application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High > System CPU * > other: cpu=19.49% > heap allocation rate *403mb*/s > [000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129 > [000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34 > [000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56 > [000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79 > [000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78 > [000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41 > > On doing strace it was found that the following system call is consuming > all the system CPU > timeout 10s strace -f -p 5954 -c -q > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > > *88.33 1712.798399 16674 102723 22191 futex* 3.98 > 77.098730 4356 17700 read > 3.27 63.474795 394253 161 29 restart_syscall > 3.23 62.601530 29768 2103 epoll_wait > > On searching we found the following bug with the RHEL 6.6, CentOS 6.6 > kernel seems to be a probable cause for the issue: > > https://docs.datastax.com/en/landing_page/doc/landing_page/ > troubleshooting/cassandra/fetuxWaitBug.html > > The patch fix mentioned in the doc is also not present in our kernel. > > sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref > - [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347] > {CVE-2010-0623} > > Can some who has faced and resolved this issue help us here. > > Thanks, > Abhishek > > > -- > Ben Bromhead > CTO | Instaclustr <https://www.instaclustr.com/> > +1 650 284 9692 > Managed Cassandra / Spark on AWS, Azure and Softlayer >