[ 
https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418154#comment-16418154
 ] 

Todd Lipcon commented on KUDU-2359:
-----------------------------------

I think the point is that, often times, after a server crash, things are 
configured to automatically reboot, and upon a reboot the Kudu daemon will 
automatically restart. So, there is no operator involvement to restart a 
crashed service. Or, a non-Kudu-expert operator knows enough to see that a 
tserver has crashed and restart the service, but isn't familiar enough to start 
modifying flags, etc. Additionally,  maintaining a separate set of flags on 
different daemons in a cluster gets complex.

bq. It also begs the question, would operators even care about those failed 
tablets? If our re-replication story is robust enough to handle everything on 
its own, it could be seen as a pointless configuration. I suppose exposing it 
as a flag initially would give us that sort of info.

right, I think in the common case, you want the server to come back, and then 
it'll notice the failed 25% of tablets, and re-replicate them elsewhere. 
Currently as it is, it's likely the server will be down for a day or two before 
the operator figures out the right way to run the 'update-dirs' tool, etc, and 
by that time when they get the server back up, everything has been 
re-replicated elsewhere already.

> tserver should allow starting with a small number of missing data dirs
> ----------------------------------------------------------------------
>
>                 Key: KUDU-2359
>                 URL: https://issues.apache.org/jira/browse/KUDU-2359
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs, tserver
>            Reporter: Todd Lipcon
>            Priority: Major
>
> Often when a disk fails, its mount point will not come back up when the 
> server is restarted. Currently, Kudu will respond to this by failing to 
> restart with an error like:
> F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() 
> Bad status: Already present: FS layout already exists; not overwriting 
> existing layout. See 
> https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: 
> unable to create file system roots: FSManager roots already exist: 
> /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal
> However, this defeats some of the advantages of the "allow single disk 
> failure" work. One could use the update_data_dirs tool to remove the missing 
> disk, but you'd also need to persistently change the configuration of the 
> daemon, which is hard to do with a consistent configuration management.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to