Thanks ASAD, I will look more into it. Regards Martin On Thu, Aug 1, 2019 at 11:40 PM ZAIDI, ASAD A <az1...@att.com> wrote:
> I don’t think anyone can predict with certainty if instance won’t crash > but there are good chances it will - unless you take remedial actions. > > If you are not doing subrange repair, a lot of merkle tree data can > potentially be scanned/streamed taking toll on memory resources – that , > taking account of all other running operations , easily bust available > memory. > > > > You can do few things like – as short term measure – increase allotted > heap size along with running subrange repair with script > <https://github.com/BrianGallew/cassandra_range_repair> or by using > reaper tool. > > You may also want to check partition sizes of tables (nodetool tablestats) > if they’re bloated. See if table scans are infested with lots of > tombstones which in turn also tax on heap consumption. My $.002 cents for > the moment. > > > > > > > > *From:* Martin Xue [mailto:martin...@gmail.com] > *Sent:* Wednesday, July 31, 2019 5:05 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Repair failed and crash the node, how to bring it back? > > > > Hi Alex, > > > > Thanks for your reply. The disk space was around 80%. The crash happened > during repair, primary range full repair on 1TB keyspace. > > > > Would that crash again? > > > > Thanks > > Regards > > Martin > > > > On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, < > a...@thelastpickle.com> wrote: > > It looks like you have a corrupted hint file. > > Did the node run out of disk space while repair was running? > > > > You might want to move the hint files off their current directory and try > to restart the node again. > > Since you'll have lost mutations then, you'll need... to run repair ¯\_(ツ > )_/¯ > > > > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > > http://www.thelastpickle.com > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com_&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=aqZZWkKEeWsVQ2PMJTVVVUalIrPHmEb-m_FKlC77K7E&s=xmeRHOldyQXvEdz1mzGbFU2MRD5a-dY5qjQeOkovzWM&e=> > > > > > > On Wed, Jul 31, 2019 at 3:51 PM Martin Xue <martin...@gmail.com> wrote: > > Hi, > > > > I am running repair on production, started with one of 6 nodes in the > cluster (3 nodes in each of two DC). Cassandra version 3.0.14. > > > > running: repair -pr --full keyspace on node 1, 1TB data, takes two days, > and crash, > > > > error shows: > > 3202]] finished (progress: 3%) > Exception occurred during clean-up. > java.lang.reflect.UndeclaredThrowableException > Cassandra has shutdown. > error: [2019-07-31 20:19:20,797] JMX connection closed. You should check > server log for repair status of keyspace keyspace_masked (Subsequent > keyspaces are not going to be repaired). > -- StackTrace -- > java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You > should check server log for repair status of keyspace keyspace_masked > keyspaces are not going to be repaired). > at > org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97) > at > org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91) > at > org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90) > at > javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275) > at > javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352) > at > javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337) > at > javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248) > at > javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441) > at > javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533) > at > javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121) > at > javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534) > at > javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352) > at > com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655) > at > com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607) > at > com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471) > at > com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452) > at > com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108) > > > > system.log shows > > INFO [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1 > Young Generation GC in 2915ms. G1 Eden Space: 914358272 -> 0; G1 Old Gen: > 19043999248 -> 20219035248; > INFO [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool > Name Active Pending Completed Blocked All Time > Blocked > INFO [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 - > MutationStage 19 15 9578177305 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - > ReadStage 10 0 219357504 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - > RequestResponseStage 1 0 625174550 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - > ReadRepairStage 0 0 2544772 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - > MiscStage 0 0 0 0 > 0 > > INFO [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 - > CompactionExecutor 1 1 9515493 0 > 0 > > > > > > When I restart the cassandra, it still failed, > > now the error in system.log shows: > > > > INFO [main] 2019-07-31 21:35:02,044 StorageService.java:575 - Cassandra > version: 3.0.14 > INFO [main] 2019-07-31 21:35:02,044 StorageService.java:576 - Thrift API > version: 20.1.0 > INFO [main] 2019-07-31 21:35:02,044 StorageService.java:577 - CQL > supported versions: 3.4.0 (default: 3.4.0) > ERROR [main] 2019-07-31 21:35:02,075 CassandraDaemon.java:710 - Exception > encountered during startup > org.apache.cassandra.io.FSReadError: java.io.EOFException > at > org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > ~[na:1.8.0_171] > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > ~[na:1.8.0_171] > at java.util.Iterator.forEachRemaining(Iterator.java:116) > ~[na:1.8.0_171] > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > ~[na:1.8.0_171] > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[na:1.8.0_171] > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > ~[na:1.8.0_171] > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > ~[na:1.8.0_171] > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > ~[na:1.8.0_171] > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > ~[na:1.8.0_171] > at > org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.java:121) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at java.lang.Class.forName0(Native Method) ~[na:1.8.0_171] > at java.lang.Class.forName(Class.java:264) ~[na:1.8.0_171] > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:570) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346) > [apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) > [apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697) > [apache-cassandra-3.0.14.jar:3.0.14] > Caused by: java.io.EOFException: null > at java.io.RandomAccessFile.readInt(RandomAccessFile.java:803) > ~[na:1.8.0_171] > at > org.apache.cassandra.hints.HintsDescriptor.deserialize(HintsDescriptor.java:237) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:138) > ~[apache-cassandra-3.0.14.jar:3.0.14] > ... 20 common frames omitted > > > > > > Can anyone help how to bring back the node again? > > > > Also there are (anti-compaction after repair) running on other nodes, > shall I stopped them as well, if so how to do it (nodetool stop > compaction?)? > > > > Any suggestions will be much appreciated. > > > > Thanks > > Regards > > Martin > > > > > >