[ 
https://issues.apache.org/jira/browse/KUDU-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351135#comment-17351135
 ] 

Abhishek commented on KUDU-1959:
--------------------------------

As a first step towards this issue, we could have a tablet server startup page 
which shows the progress of the starting up. Guess we could break down the 
tablet server startup into a few phases (something like initializing, reading 
the metadata directory, reading the data directory, bootstrapping, connecting 
to the masters). The usual major time consuming phases are reading the log 
block containers (reading the data directory) and bootstrapping the tablets. 

For these two phases we can include the total LBM containers/tablets present 
and the ones which are processed until that time to keep a track of the 
progress of the startup. 
Now for the question, how do we get total LBM containers - since we do not have 
any metric for that yet (Even if we had this would have been reset after the 
restart of the server), we could just get the number of data files in the 
presented data directories.
The total tablets present is obtainable after scanning the metadata directory.

In the current state we start the tablet server WebUI while the tablets are in 
bootstrapping phase. We could startup the WebUI before this phase but just 
start the Tablet server startup progress page and load the other pages once we 
get to the bootstrapping phase.

> Hard to tell when a cluster is done starting up
> -----------------------------------------------
>
>                 Key: KUDU-1959
>                 URL: https://issues.apache.org/jira/browse/KUDU-1959
>             Project: Kudu
>          Issue Type: Improvement
>          Components: ops-tooling
>            Reporter: Jean-Daniel Cryans
>            Assignee: Abhishek
>            Priority: Major
>              Labels: roadmap-candidate, usability
>
> Restarting a cluster that has a good amount of data, it's hard to tell when 
> it's "done". Right now the things I do:
>  - Run ksck, wait until most tablets are not in "unavailable" or 
> "boostrapping" state.
>  - Watch the metrics and see when the data under management is close to where 
> it was before restarting (it grows as tablets are getting bootstrapped).
>  - Look at the tablet server web UIs for tablets, compare how many are done 
> bootstrapping VS in the process of VS not started.
> Ideas on how to improve this:
>  - In the master's web UI for tablet servers, show how many tablets are 
> running VS not running (I wouldn't add anything about tombstoned tablets)
>  - Add metrics for tablets in different states.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to