Well, I tried to fate fail, but it fails with message ERROT: master lock is held, not running Could not fail transaction:
For fate delete I also get similar message. So how can I just remove this transaction? -S -----Original Message----- From: Ligade, Shailesh (ITADD) (CON) Sent: Tuesday, October 11, 2022 11:44 AM To: user@accumulo.apache.org Subject: RE: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised error Thanks Ed, This really helps. I didn't see any other exception in the tserver log. Will check other things too. Appreciated -S -----Original Message----- From: Ed Coleman <edcole...@apache.org> Sent: Tuesday, October 11, 2022 11:35 AM To: user@accumulo.apache.org Subject: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised error Is the table being compacted? And then, can compactions be completed on that table? You can scan hdfs under the table id, and look for xx.rf_tmp files - those are the temporary files generated by the compaction and would be swapped to xx.rf files when the compaction completes. Things to look for: - Are there other exceptions in the tserver log? - With a compaction running, is there one (or a few) hdfs directories that have tmp_rf files much longer than the others - use that info to track the rows in the metadata table and see what sticks out. - Are there xx.rf_tmp files that are huge? - Is the compaction making progress? (the tmp_rf file size might change over time) - but the compaction is just processing deletes, maybe not. - Are there hdfs directories that have a lot of files and the timestamp of the last compacted files is really old? - In the accumulo.metadata table, there should be a compaction count (once a tablet is compacted) - "srv:compact" - you may be able to scan the metadata for your table, find a compact id that is lagging the others and then use the row info for that id to isolate the tablet server that is hosting the data and then look at logs there. (This is assuming that the compaction is not completing) You likely can delete the FATE, but if there is an underlying issue, then at the next compaction it seems like it would just reoccur. You would be better off finding the issue and fixing it before deleting the FATE. Ed Coleman On 2022/10/11 12:44:01 Shailesh Ligade via user wrote: > Looking at the fate print/dump > > I do see repo: { > "org.apache.accumulo.master.tableOps.CompactRange" { > tableId: xx > namespace: default > } > } > > Does that mean it is stuck on table compact operation but can't finish it for > whatever reason and hence I it drops tserver connection? > Is it safe to fail/delete this fate? What are the alternatives, if any? > > Appreciate your help > > -S > > From: Shailesh Ligade via user <user@accumulo.apache.org> > Sent: Tuesday, October 11, 2022 8:09 AM > To: user@accumulo.apache.org > Subject: [EXTERNAL EMAIL] - accumu 1.10.0 master log connection > refised error > > Hello, > > I have 25 node cluster with two masters. Time to time (every 4/5 > hours) I get on different tserver > Org.apache.thrift.transport.TTransportException: > java.net.ConnectionException: Connection refused Error closing output > stream > Java.ioException: The stream is closed > SocketOutputStream.write(SocketOutputStream.java:118) > ... > > master.LiveTServerSet$TServerConnection.compact(LiveTServerSet.java:214) > master.tableOps.CompactionDriver.isReady(CompactionDriver:168) > master.tableOps.CompactionDriver.isReady(CompactionDriver:54) > master.tableOps.Tracerepo.isReady(Tracerepo.java:47) > fate.Fate$TransactionRunner.run(Fate.java:72) > > Everytime its same exception? What may be an issue? Is it stuck in some fate > operation? > After this tserver restarts (I have it system, with auto restart flag) > > How to debug this further. > Appreciate any response. > > -S >