[
https://issues.apache.org/jira/browse/HBASE-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136658#comment-15136658
]
Vishal Khandelwal commented on HBASE-15219:
-------------------------------------------
Hi [~tedyu] - I have applied you pacth on 0.98.17 branch on our 20 node
cluster. It does sends the non-zero exit code but it Tool interrupts the
execution on first attempt it self. So for first fallure canary wills stop
and won't moved forward. Ideally we should sniff to all the regions, logs
should show all failures at once and exit code should be non-zero.
{code}
2016-02-08 07:12:26,236 ERROR [Thread-6] tool.Canary - Run regionMonitor failed
java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at
java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:244)
at org.apache.hadoop.hbase.tool.Canary.sniff(Canary.java:949)
at org.apache.hadoop.hbase.tool.Canary.access$200(Canary.java:91)
at
org.apache.hadoop.hbase.tool.Canary$RegionMonitor.sniff(Canary.java:839)
at
org.apache.hadoop.hbase.tool.Canary$RegionMonitor.run(Canary.java:762)
at java.lang.Thread.run(Thread.java:745)
2016-02-08 07:12:26,236 ERROR [pool-2-thread-9] tool.Canary - read from region
CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
column family 0 failed
java.io.InterruptedIOException: Interrupted after 0 tries on 2
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:151)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:833)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:810)
at org.apache.hadoop.hbase.tool.Canary$RegionTask.read(Canary.java:255)
at org.apache.hadoop.hbase.tool.Canary$RegionTask.call(Canary.java:202)
at org.apache.hadoop.hbase.tool.Canary$RegionTask.call(Canary.java:182)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
EXIT_CODE =2
{code}
> Canary tool does not return non-zero exit when one of region stuck state
> -------------------------------------------------------------------------
>
> Key: HBASE-15219
> URL: https://issues.apache.org/jira/browse/HBASE-15219
> Project: HBase
> Issue Type: Bug
> Components: canary
> Affects Versions: 0.98.16
> Reporter: Vishal Khandelwal
> Assignee: Ted Yu
> Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.2.1, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15219.v1.patch, HBASE-15219.v3.patch,
> HBASE-15219.v4.patch
>
>
> {code}
> 2016-02-05 12:24:18,571 ERROR [pool-2-thread-7] tool.Canary - read from
> region
> CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
> column family 0 failed
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Fri Feb 05 12:24:15 GMT 2016,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@54c9fea0,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region
> CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
> is not online on isthbase02-dnds1-3-crd.eng.sfdc.net,60020,1454669984738
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2852)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4468)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2984)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31186)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2149)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
> at java.lang.Thread.run(Thread.java:745)
> --------
> -bash-4.1$ echo $?
> 0
> {code}
> Below code prints the error but it does sets/returns the exit code. Due to
> this tool can't be integrated with nagios or other alerting.
> Ideally it should return error for failures. as pre the documentation:
> <snip>
> This tool will return non zero error codes to user for collaborating with
> other monitoring tools, such as Nagios. The error code definitions are:
> private static final int USAGE_EXIT_CODE = 1;
> private static final int INIT_ERROR_EXIT_CODE = 2;
> private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
> private static final int ERROR_EXIT_CODE = 4;
> </snip>
> {code}
> org.apache.hadoop.hbase.tool.Canary.RegionTask
> public Void read() {
> ....
> try {
> table = connection.getTable(region.getTable());
> tableDesc = table.getTableDescriptor();
> } catch (IOException e) {
> LOG.debug("sniffRegion failed", e);
> sink.publishReadFailure(region, e);
> ...
> return null;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)