Last week I moved all nodes back to cassandra 3.9. Everything worked fine since then. Yesterday I tried to upgrade again, running a rolling restart after the upgrade.Nodes were just fine. Today one node started consuming 94.6% of its CPU. Compacting is running all the time for this node. I'm afraid to have the remaining nodes increasing their CPUs over the next couple of days, as happened last week.
2017-09-03 22:28 GMT-03:00 kurt greaves <k...@instaclustr.com>: > Can't say that message explains why the compaction would be stuck. > Generally not a good sign and you might need to investigate more but > hopefully shouldn't be related. Has that stuck compaction moved since last > week? > > > On 1 September 2017 at 22:54, Fay Hou [Storage Service] < > fay...@coupang.com> wrote: > >> try to do a rolling restart for the cluster before doing a compation >> >> On Fri, Sep 1, 2017 at 3:09 PM, Igor Leão <igor.l...@ubee.in> wrote: >> >>> Some generic errors: >>> >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i error* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i excep* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i fail* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i error* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i exce* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i fail* >>> *DEBUG [GossipStage:1] 2017-09-01 15:33:27,046 FailureDetector.java:457 >>> - Ignoring interval time of 2108299431 <(210)%20829-9431> for /172.16.1.112 >>> <http://172.16.1.112/>* >>> *DEBUG [GossipStage:1] 2017-09-01 15:33:29,051 FailureDetector.java:457 >>> - Ignoring interval time of 2005507384 for /172.16.1.74 >>> <http://172.16.1.74/>* >>> *DEBUG [GossipStage:1] 2017-09-01 15:33:45,968 FailureDetector.java:457 >>> - Ignoring interval time of 2003371497 for /172.16.1.74 >>> <http://172.16.1.74/>* >>> *DEBUG [GossipStage:1] 2017-09-01 15:33:51,133 FailureDetector.java:457 >>> - Ignoring interval time of 2013260173 <(201)%20326-0173> for /172.16.1.74 >>> <http://172.16.1.74/>* >>> *DEBUG [GossipStage:1] 2017-09-01 15:33:58,981 FailureDetector.java:457 >>> - Ignoring interval time of 2009620081 for /172.16.1.112 >>> <http://172.16.1.112/>* >>> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 >>> - Ignoring interval time of 2010956256 for /172.16.1.74 >>> <http://172.16.1.74/>* >>> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 >>> - Ignoring interval time of 2011127930 for /10.0.1.122 <http://10.0.1.122/>* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i error* >>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() >>> failed: Connection reset by peer* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i exce* >>> *INFO [Native-Transport-Requests-5] 2017-09-01 15:22:58,806 >>> Message.java:619 - Unexpected exception during request; channel = [id: >>> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> ! >>> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]* >>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() >>> failed: Connection reset by peer* >>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i fail* >>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() >>> failed: Connection reset by peer* >>> >>> >>> Some interesting errors: >>> >>> 1. >>> *DEBUG [ReadRepairStage:1] 2017-09-01 15:34:58,485 ReadCallback.java:242 >>> - Digest mismatch:* >>> *org.apache.cassandra.service.DigestMismatchException: Mismatch for key >>> DecoratedKey(5988282114260523734, >>> 32623331326162652d633533332d343237632d626334322d306466643762653836343830) >>> (023d99bbcf2263f0fa450c2312fdce88 vs a60ba37a46e0a61227a8b560fa4e0dfb)* >>> * at >>> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) >>> ~[apache-cassandra-3.11.0.jar:3.11.0]* >>> * at >>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) >>> ~[apache-cassandra-3.11.0.jar:3.11.0]* >>> * at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> [na:1.8.0_112]* >>> * at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> [na:1.8.0_112]* >>> * at >>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) >>> [apache-cassandra-3.11.0.jar:3.11.0]* >>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]* >>> >>> 2. >>> *INFO [Native-Transport-Requests-5] 2017-09-01 15:22:58,806 >>> Message.java:619 - Unexpected exception during request; channel = [id: >>> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> ! >>> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]* >>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() >>> failed: Connection reset by peer* >>> * at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown >>> Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]* >>> *INFO [Native-Transport-Requests-11] 2017-09-01 15:31:42,722 >>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot >>> allocate chunk of 1.000MiB* >>> >>> *INFO [CompactionExecutor:470] 2017-09-01 10:16:42,026 >>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot >>> allocate chunk of 1.000MiB* >>> *INFO [CompactionExecutor:475] 2017-09-01 10:31:42,032 >>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot >>> allocate chunk of 1.000MiB* >>> *INFO [CompactionExecutor:478] 2017-09-01 10:46:42,108 >>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot >>> allocate chunk of 1.000MiB* >>> *INFO [CompactionExecutor:482] 2017-09-01 11:01:42,131 >>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot >>> allocate chunk of 1.000MiB* >>> >>> About this last error, I tried to increase `file_cache_size_in_mb` of >>> this node to 2048, but the error only changed to >>> *INFO [ReadStage-2] 2017-09-01 16:18:38,657 NoSpamLogger.java:91 - >>> Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB* >>> >>> 2017-09-01 9:07 GMT-03:00 kurt greaves <k...@instaclustr.com>: >>> >>>> are you seeing any errors in the logs? Is that one compaction still >>>> getting stuck? >>>> >>> >>> > -- Igor Leão Site Reliability Engineer Mobile: +55 81 99727-1083 <callto://+5581997271083> Skype: *igorvpcleao* Office: +55 81 4042-9757 <callto://+558140429757> Website: inlocomedia.com <http://www.inlocomedia.com/> [image: inlocomedia] <http://t.sidekickopen29.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XX45w6CwnN7dSpvzQZpw8W8pTc_456dVQFdQm8LT02?t=http%3A%2F%2Fwww.inlocomedia.com%2F&si=4991638468296704&pi=9266b53b-57c9-4b38-d81a-d2f8f01ed355> [image: LinkedIn] <http://t.sidekickopen29.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XX45w6CwnN7dSpvzQZpw8W8pTc_456dVQFdQm8LT02?t=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fin-loco-media&si=4991638468296704&pi=9266b53b-57c9-4b38-d81a-d2f8f01ed355> [image: Facebook] <https://www.facebook.com/inlocomedia> [image: Twitter] <http://t.sidekickopen29.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XX45w6CwnN7dSpvzQZpw8W8pTc_456dVQFdQm8LT02?t=https%3A%2F%2Ftwitter.com%2Finlocomedia&si=4991638468296704&pi=9266b53b-57c9-4b38-d81a-d2f8f01ed355>