You need to kill any running masters so that the FATE command can get the 
master lock.  Once deleted, restart them.

Ed Coleman

-----Original Message-----
From: Shailesh Ligade via user <user@accumulo.apache.org> 
Sent: Monday, October 17, 2022 9:22 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised 
error

Well,

I tried to fate fail, but it fails with message
ERROT: master lock is held, not running
Could not fail transaction:

For fate delete I also get similar message. 

So how can I just remove this transaction?

-S
-----Original Message-----
From: Ligade, Shailesh (ITADD) (CON)
Sent: Tuesday, October 11, 2022 11:44 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised 
error

Thanks Ed,

This really helps.

I didn't see any other exception in the tserver log. Will check other things 
too.

Appreciated

-S

-----Original Message-----
From: Ed Coleman <edcole...@apache.org>
Sent: Tuesday, October 11, 2022 11:35 AM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised 
error

Is the table being compacted?  And then, can compactions be completed on that 
table?

You can scan hdfs under the table id, and look for xx.rf_tmp files - those are 
the temporary files generated by the compaction and would be swapped to xx.rf 
files when the compaction completes. Things to look for:

  - Are there other exceptions in the tserver log?
  - With a compaction running, is there one (or a few) hdfs directories that 
have tmp_rf files much longer than the others - use that info to track the rows 
in the metadata table and see what sticks out.
  - Are there xx.rf_tmp files that are huge?  
  - Is the compaction making progress?  (the tmp_rf file size might change over 
time) - but the compaction is just processing deletes, maybe not.
  - Are there hdfs directories that have a lot of files and the timestamp of 
the last compacted files is really old?
  - In the accumulo.metadata table, there should be a compaction count (once a 
tablet is compacted) - "srv:compact" - you may be able to scan the metadata for 
your table, find a compact id that is lagging the others and then use the row 
info for that id to isolate the tablet server that is hosting the data and then 
look at logs there. (This is assuming that the compaction is not completing)

You likely can delete the FATE, but if there is an underlying issue, then at 
the next compaction it seems like it would just reoccur. You would be better 
off finding the issue and fixing it before deleting the FATE.

Ed Coleman

On 2022/10/11 12:44:01 Shailesh Ligade via user wrote:
> Looking at the fate print/dump
> 
> I do see repo: {
>    "org.apache.accumulo.master.tableOps.CompactRange" {
>        tableId: xx
>        namespace: default
>       }
> }
> 
> Does that mean it is stuck on table compact operation but can't finish it for 
> whatever reason and hence I it drops tserver connection?
> Is it safe to fail/delete this fate? What are the alternatives, if any?
> 
> Appreciate your help
> 
> -S
> 
> From: Shailesh Ligade via user <user@accumulo.apache.org>
> Sent: Tuesday, October 11, 2022 8:09 AM
> To: user@accumulo.apache.org
> Subject: [EXTERNAL EMAIL] - accumu 1.10.0 master log connection 
> refised error
> 
> Hello,
> 
> I have 25 node cluster with two masters. Time to time (every 4/5
> hours) I get on different tserver
> Org.apache.thrift.transport.TTransportException: 
> java.net.ConnectionException: Connection refused Error closing output 
> stream
> Java.ioException: The stream is closed
>                 SocketOutputStream.write(SocketOutputStream.java:118)
> ...
>                
> master.LiveTServerSet$TServerConnection.compact(LiveTServerSet.java:214)
>                master.tableOps.CompactionDriver.isReady(CompactionDriver:168)
>                 master.tableOps.CompactionDriver.isReady(CompactionDriver:54)
>                 master.tableOps.Tracerepo.isReady(Tracerepo.java:47)
>                 fate.Fate$TransactionRunner.run(Fate.java:72)
> 
> Everytime its same exception? What may be an issue? Is it stuck in some fate 
> operation?
> After this tserver restarts (I have it system, with auto restart flag)
> 
> How to debug this further.
> Appreciate any response.
> 
> -S
> 

Reply via email to