[ https://issues.apache.org/jira/browse/KUDU-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin updated KUDU-3458: -------------------------------- Labels: scalability supportability troubleshooting (was: ) > Continue loading other tablets even if metadata for some tablets failed to > load > ------------------------------------------------------------------------------- > > Key: KUDU-3458 > URL: https://issues.apache.org/jira/browse/KUDU-3458 > Project: Kudu > Issue Type: Improvement > Components: tserver > Reporter: Alexey Serbin > Priority: Major > Labels: scalability, supportability, troubleshooting > > kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed > to load (the kudu-tserver process exits on such an event, but with caveat of > KUDU-3419). > This current behavior requires manual intervention. In most cases, the > reason behind the failure to load tablet metadata is corrupted metadata file. > The suspect behind such a corruption is a power failure, kernel panic, etc. > where opened file isn't synced. > In case of a cluster with many tablet servers, where RF=3, if majority of > tablet replicas is present, such a situation with corrupted file could be > addressed automatically if the tablet server would continue bootstrapping of > other tablet replicas and eventually registered with Kudu masters. The > system catalog would detect that the tablet is under-replicated because one > replica isn't running, and would re-replicate it elsewhere, sending > DELETE_TABLET for the tablet replica that has the corrupted metadata file. > That'd be similar to what happens if a consensus metadata for a tablet > replica were corrupted. > It's necessary to update the code in {{TSTabletManager}} and allow > {{TSTabletManager::Init()}} to complete successfully in such case, marking > corresponding tablet replicas as failed to load (similar to what's done in > case of replica's consensus metadata). -- This message was sent by Atlassian Jira (v8.20.10#820010)