[jira] [Created] (KUDU-3505) kudu ksck fails if healthy master in healthy cluster is started after the command

Jira Thu, 24 Aug 2023 02:53:09 -0700

Bakai Ádám created KUDU-3505:
--------------------------------

             Summary: kudu ksck fails if healthy master in healthy cluster is 
started after the command
                 Key: KUDU-3505
                 URL: https://issues.apache.org/jira/browse/KUDU-3505
             Project: Kudu
          Issue Type: Bug
          Components: master
            Reporter: Bakai Ádám



Environment: single master configuration
If the master is not running and the user starts a kudu cluster ksck command 
then it will try to connect to the master over and over again. Once the master 
is started, the ksck command is executed and it shows a bunch of errors:
{code:java}
adambakai@abakai-MBP16 d % kudu cluster ksck localhost:8764 -ksck_format 
plain_full
Master Summary
               UUID               |    Address     | Status
----------------------------------+----------------+---------
 f41052a1ba8242d49ee5e16c0d60558a | localhost:8764 | HEALTHY
All reported replicas are:
  A = f41052a1ba8242d49ee5e16c0d60558a
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---------------+----------+--------------+--------------+------------
 A             | A*       | 16           | -1           | YesFlags of checked 
categories for Master:
        Flag         |                            Value                         
   |         Master
---------------------+-------------------------------------------------------------+-------------------------
 builtin_ntp_servers | 
0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org | all 1 server(s) 
checked
 time_source         | system_unsync                                            
   | all 1 server(s) checkedTablet Server Summary
               UUID               |    Address     | Status  | Location | 
Tablet Leaders | Active Scanners
----------------------------------+----------------+---------+----------+----------------+-----------------
 1938796538bf483f9bcd133e29aa645b | 127.0.0.1:9878 | HEALTHY | <none>   |       
0        |       0
 8080a72aeb714c5087b8c515f21b1735 | 127.0.0.1:9870 | HEALTHY | <none>   |       
1        |       0
 9f86252d00814cb3ae0ef6858ee31a02 | 127.0.0.1:9874 | HEALTHY | <none>   |       
0        |       0
 c23de9c2b3e1448fa8dde2bb1a292388 | 127.0.0.1:9872 | HEALTHY | <none>   |       
0        |       0
 fb700997c9274a9d8287eb3c765606d2 | 127.0.0.1:9876 | HEALTHY | <none>   |       
0        |       0Tablet Server Location Summary
 Location |  Count
----------+---------
 <none>   |       5Flags of checked categories for Tablet Server:
        Flag         |                            Value                         
   |      Tablet Server
---------------------+-------------------------------------------------------------+-------------------------
 builtin_ntp_servers | 
0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org | all 5 server(s) 
checked
 time_source         | system_unsync                                            
   | all 5 server(s) checkedVersion Summary
     Version     |         Servers
-----------------+-------------------------
 1.18.0-SNAPSHOT | all 6 server(s) checkedTablet Summary
Tablet 5d87f015c3a2438c8cec6e84796f9ecb of table 'db.test_table' is healthy.
  8080a72aeb714c5087b8c515f21b1735 (127.0.0.1:9870): RUNNING [LEADER]
  c23de9c2b3e1448fa8dde2bb1a292388 (127.0.0.1:9872): RUNNING
  9f86252d00814cb3ae0ef6858ee31a02 (127.0.0.1:9874): RUNNING
All reported replicas are:
  A = 8080a72aeb714c5087b8c515f21b1735
  B = c23de9c2b3e1448fa8dde2bb1a292388
  C = 9f86252d00814cb3ae0ef6858ee31a02
The consensus matrix is:
 Config source |   Replicas   | Current term | Config index | Committed?
---------------+--------------+--------------+--------------+------------
 master        | A*  B   C    |              |              | Yes
 A             | A*  B   C    | 3            | -1           | Yes
 B             | A*  B   C    | 3            | -1           | Yes
 C             | A*  B   C    | 3            | -1           | YesThe cluster 
doesn't have any matching system tables
Summary by table
     Name      | RF | Status  | Total Tablets | Healthy | Recovering | 
Under-replicated | Unavailable
---------------+----+---------+---------------+---------+------------+------------------+-------------
 db.test_table | 3  | HEALTHY | 1             | 1       | 0          | 0        
        | 0Tablet Replica Count Summary
   Statistic    | Replica Count
----------------+---------------
 Minimum        | 0
 First Quartile | 0
 Median         | 1
 Third Quartile | 1
 Maximum        | 1Tablet Replica Count by Tablet Server
               UUID               |      Host      | Replica Count
----------------------------------+----------------+---------------
 1938796538bf483f9bcd133e29aa645b | 127.0.0.1:9878 | 0
 8080a72aeb714c5087b8c515f21b1735 | 127.0.0.1:9870 | 1
 9f86252d00814cb3ae0ef6858ee31a02 | 127.0.0.1:9874 | 1
 c23de9c2b3e1448fa8dde2bb1a292388 | 127.0.0.1:9872 | 1
 fb700997c9274a9d8287eb3c765606d2 | 127.0.0.1:9876 | 0Total Count Summary
                | Total Count
----------------+-------------
 Masters        | 1
 Tablet Servers | 5
 Tables         | 1
 Tablets        | 1
 Replicas       | 3OK
adambakai@abakai-MBP16 d % kudu cluster ksck localhost:8764 -ksck_format 
plain_full
Master Summary
            UUID            |    Address     |   Status
----------------------------+----------------+-------------
 <unknown> (localhost:8764) | localhost:8764 | UNAVAILABLE
Error from localhost:8764: Network error: Client connection negotiation failed: 
client connection to 127.0.0.1:8764: connect: Connection refused (error 61) 
(UNAVAILABLE)
All reported replicas are:
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---------------+----------+--------------+--------------+------------Tablet 
Server Summary
Version Summary
 Version | Servers
---------+---------Tablet Summary
Tablet 5d87f015c3a2438c8cec6e84796f9ecb of table 'db.test_table' is 
unavailable: 3 replica(s) not RUNNING
  8080a72aeb714c5087b8c515f21b1735: TS unavailable [LEADER]
  c23de9c2b3e1448fa8dde2bb1a292388: TS unavailable
  9f86252d00814cb3ae0ef6858ee31a02: TS unavailable
All reported replicas are:
  A = 8080a72aeb714c5087b8c515f21b1735
  B = c23de9c2b3e1448fa8dde2bb1a292388
  C = 9f86252d00814cb3ae0ef6858ee31a02
The consensus matrix is:
 Config source |        Replicas        | Current term | Config index | 
Committed?
---------------+------------------------+--------------+--------------+------------
 master        | A*  B   C              |              |              | Yes
 A             | [config not available] |              |              |
 B             | [config not available] |              |              |
 C             | [config not available] |              |              |The 
cluster doesn't have any matching system tables
Summary by table
     Name      | RF |   Status    | Total Tablets | Healthy | Recovering | 
Under-replicated | Unavailable
---------------+----+-------------+---------------+---------+------------+------------------+-------------
 db.test_table | 3  | UNAVAILABLE | 1             | 0       | 0          | 0    
            | 1Tablet Replica Count Summary
   Statistic    | Replica Count
----------------+---------------
 Minimum        | 1
 First Quartile | 1
 Median         | 1
 Third Quartile | 1
 Maximum        | 1Tablet Replica Count by Tablet Server
               UUID               |    Host     | Replica Count
----------------------------------+-------------+---------------
 8080a72aeb714c5087b8c515f21b1735 | unavailable | 1
 9f86252d00814cb3ae0ef6858ee31a02 | unavailable | 1
 c23de9c2b3e1448fa8dde2bb1a292388 | unavailable | 1Total Count Summary
                | Total Count
----------------+-------------
 Masters        | 1
 Tablet Servers | 0
 Tables         | 1
 Tablets        | 1
 Replicas       | 3==================
Warnings:
==================
master unusual flags check error: 1 of 1 masters were not available to retrieve 
unusual flags
master diverged flags check error: 1 of 1 masters were not available to 
retrieve time_source category flags==================
Errors:
==================
Network error: error fetching info from masters: failed to gather info from all 
masters: 1 of 1 had errors
Not found: master consensus error: no master consensus state available
Corruption: table consistency check error: 1 out of 1 table(s) are not 
healthyFAILED
Runtime error: ksck discovered errors {code}
These errors are not really errors in the sense that if the user reruns the 
command, it displays that the cluster is in OK state.
My suspicion is that the kudu ksck rpc is executed before the master can boot 
up properly, that's why it shows these error. The possible solution would be to 
only accept requests from client and kudu cli when it is properly booted up. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KUDU-3505) kudu ksck fails if healthy master in healthy cluster is started after the command

Reply via email to