Re: Repair failed and crash the node, how to bring it back?

Martin Xue Thu, 01 Aug 2019 23:47:05 -0700

Thanks ASAD, I will look more into it.
Regards
Martin

On Thu, Aug 1, 2019 at 11:40 PM ZAIDI, ASAD A <az1...@att.com> wrote:


> I don’t think anyone can predict with certainty if instance won’t crash
> but there are good chances it will -  unless you take remedial actions.
>
> If you are not doing subrange repair, a lot of merkle tree data can
> potentially be scanned/streamed taking toll on memory resources – that ,
> taking  account of all other running operations , easily bust available
> memory.
>
>
>
> You can do few things like – as short term measure – increase allotted
> heap size along with running subrange repair with script
> <https://github.com/BrianGallew/cassandra_range_repair> or by using
> reaper tool.
>
> You may also want to check partition sizes of tables (nodetool tablestats)
> if they’re bloated. See if table scans  are infested with lots of
> tombstones which in turn also tax on heap consumption. My $.002 cents for
> the moment.
>
>
>
>
>
>
>
> *From:* Martin Xue [mailto:martin...@gmail.com]
> *Sent:* Wednesday, July 31, 2019 5:05 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Repair failed and crash the node, how to bring it back?
>
>
>
> Hi Alex,
>
>
>
> Thanks for your reply. The disk space was around 80%. The crash happened
> during repair, primary range full repair on 1TB keyspace.
>
>
>
> Would that crash again?
>
>
>
> Thanks
>
> Regards
>
> Martin
>
>
>
> On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, <
> a...@thelastpickle.com> wrote:
>
> It looks like you have a corrupted hint file.
>
> Did the node run out of disk space while repair was running?
>
>
>
> You might want to move the hint files off their current directory and try
> to restart the node again.
>
> Since you'll have lost mutations then, you'll need... to run repair ¯\_(ツ
> )_/¯
>
>
>
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
>
> http://www.thelastpickle.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com_&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=aqZZWkKEeWsVQ2PMJTVVVUalIrPHmEb-m_FKlC77K7E&s=xmeRHOldyQXvEdz1mzGbFU2MRD5a-dY5qjQeOkovzWM&e=>
>
>
>
>
>
> On Wed, Jul 31, 2019 at 3:51 PM Martin Xue <martin...@gmail.com> wrote:
>
> Hi,
>
>
>
> I am running repair on production, started with one of 6 nodes in the
> cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
>
>
>
> running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
> and crash,
>
>
>
> error shows:
>
> 3202]] finished (progress: 3%)
> Exception occurred during clean-up.
> java.lang.reflect.UndeclaredThrowableException
> Cassandra has shutdown.
> error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
> server log for repair status of keyspace keyspace_masked (Subsequent
> keyspaces are not going to be repaired).
> -- StackTrace --
> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You
> should check server log for repair status of keyspace keyspace_masked
> keyspaces are not going to be repaired).
>         at
> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
>         at
> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
>         at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
>         at
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>         at
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>         at
> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
>         at
> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
>         at
> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
>         at
> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
>         at
> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
>         at
> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
>         at
> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>
>
>
> system.log shows
>
> INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
> Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
> 19043999248 -> 20219035248;
> INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool
> Name                    Active   Pending      Completed   Blocked  All Time
> Blocked
> INFO  [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 -
> MutationStage                    19        15     9578177305         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ViewMutationStage                 0         0              0         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ReadStage                        10         0      219357504         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> RequestResponseStage              1         0      625174550         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ReadRepairStage                   0         0        2544772         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> CounterMutationStage              0         0              0         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> MiscStage                         0         0              0         0
>             0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 -
> CompactionExecutor                1         1        9515493         0
>             0
>
>
>
>
>
> When I restart the cassandra, it still failed,
>
> now the error in system.log shows:
>
>
>
> INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:575 - Cassandra
> version: 3.0.14
> INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:576 - Thrift API
> version: 20.1.0
> INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:577 - CQL
> supported versions: 3.4.0 (default: 3.4.0)
> ERROR [main] 2019-07-31 21:35:02,075 CassandraDaemon.java:710 - Exception
> encountered during startup
> org.apache.cassandra.io.FSReadError: java.io.EOFException
>         at
> org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> ~[na:1.8.0_171]
>         at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[na:1.8.0_171]
>         at java.util.Iterator.forEachRemaining(Iterator.java:116)
> ~[na:1.8.0_171]
>         at
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> ~[na:1.8.0_171]
>         at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[na:1.8.0_171]
>         at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[na:1.8.0_171]
>         at
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> ~[na:1.8.0_171]
>         at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[na:1.8.0_171]
>         at
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> ~[na:1.8.0_171]
>         at
> org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.java:121)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at java.lang.Class.forName0(Native Method) ~[na:1.8.0_171]
>         at java.lang.Class.forName(Class.java:264) ~[na:1.8.0_171]
>         at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:585)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:570)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346)
> [apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
> [apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697)
> [apache-cassandra-3.0.14.jar:3.0.14]
> Caused by: java.io.EOFException: null
>         at java.io.RandomAccessFile.readInt(RandomAccessFile.java:803)
> ~[na:1.8.0_171]
>         at
> org.apache.cassandra.hints.HintsDescriptor.deserialize(HintsDescriptor.java:237)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         at
> org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:138)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>         ... 20 common frames omitted
>
>
>
>
>
> Can anyone help how to bring back the node again?
>
>
>
> Also there are (anti-compaction after repair) running on other nodes,
> shall I stopped them as well, if so how to do it (nodetool stop
> compaction?)?
>
>
>
> Any suggestions will be much appreciated.
>
>
>
> Thanks
>
> Regards
>
> Martin
>
>
>
>
>
>

Re: Repair failed and crash the node, how to bring it back?

Reply via email to