[ https://issues.apache.org/jira/browse/KUDU-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin updated KUDU-3124: -------------------------------- Description: As of now, catalog manager (a part of kudu-master) sends {{CreateTabletRequest}} and {{DeleteTabletRequest}} RPCs as soon as they are realized by {{CatalogManager::ProcessPendingAssignments()}} when processing the list of deferred DDL operations, and at this level there isn't any restrictions on how many of those might be in flight or sent to a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} flag, but it works on a higher level and only during initial creation of a table). The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't created within {{\-\-tablet_creation_timeout_ms}} milliseconds, catalog manager replaces all the tablet replicas, generating a new tablet UUID and sending corresponding {{CreateTabletRequest}} RPCs to a potentially different set of tablet servers. Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas of the stalled-during-creation tablet) are sent separately in an asynchronous way as well. There are at least two issues with this approach: # The {{\-\-max_create_tablets_per_ts}} threshold limits the number of concurrent requests hitting one tablet server only during the initial creation of a table. However, nothing limits how many requests to create a table replica might hit a tablet server when adding partitions to an existing table as a result of ALTER TABLE request. # {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of corresponding tablet servers, and catalog manager stops retrying sending those after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval. This might spiral into a situation when requests to create replacement tablet replicas are passing through and executed by tablet servers, but corresponding requests to delete tablet replica cannot get through because of queue overflows, with catalog manager eventually giving up retrying the latter ones. Eventually, tablet servers end up with huge number of tablet replicas created, and they crash running out of memory. The crashed tablet servers cannot start after that because they eventually run out of memory trying to bootstrap the huge number of tablet replicas (running out of memory again). See https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and [KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding issue reported some time ago. Ideally, the catalog manager should put all the generated {{\{Create,Delete\}TabletRequests}} into per-tserver queues and make sure at most {{N}} requests are being processed by a tablet server at a time (the {{N}} parameter is a replacement for {{\-\-max_create_tablets_per_ts}} in this safer approach). was: As of now, catalog manager (a part of kudu-master) sends {{CreateTabletRequest}} RPC as soon as they are realized by {{CatalogManager::ProcessPendingAssignments()}} when processing the list of deferred DDL operations, and at this level there isn't any restrictions on how many of those might be in flight or sent to a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} flag, but it works on a higher level and only during initial creation of a table). The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't created within {{\-\-tablet_creation_timeout_ms}} milliseconds, catalog manager replaces all the tablet replicas, generating a new tablet UUID and sending corresponding {{CreateTabletRequest}} RPCs to a potentially different set of tablet servers. Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas of the stalled-during-creation tablet) are sent separately in an asynchronous way as well. There are at least two issues with this approach: # The {{\-\-max_create_tablets_per_ts}} threshold limits the number of concurrent requests hitting one tablet server only during the initial creation of a table. However, nothing limits how many requests to create a table replica might hit a tablet server when adding partitions to an existing table as a result of ALTER TABLE request. # {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of corresponding tablet servers, and catalog manager stops retrying sending those after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval. This might spiral into a situation when requests to create replacement tablet replicas are passing through and executed by tablet servers, but corresponding requests to delete tablet replica cannot get through because of queue overflows, with catalog manager eventually giving up retrying the latter ones. Eventually, tablet servers end up with huge number of tablet replicas created, and they crash running out of memory. The crashed tablet servers cannot start after that because they eventually run out of memory trying to bootstrap the huge number of tablet replicas (running out of memory again). See https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and [KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding issue reported some time ago. Ideally, the catalog manager should put all the generated {{\{Create,Delete\}TabletRequests}} into per-tserver queues and make sure at most {{N}} requests are being processed by a tablet server at a time (the {{N}} parameter is a replacement for {{\-\-max_create_tablets_per_ts}} in this safer approach). > A safer way to handle CreateTablet requests > ------------------------------------------- > > Key: KUDU-3124 > URL: https://issues.apache.org/jira/browse/KUDU-3124 > Project: Kudu > Issue Type: Improvement > Components: master, tserver > Affects Versions: 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, > 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0 > Reporter: Alexey Serbin > Priority: Major > > As of now, catalog manager (a part of kudu-master) sends > {{CreateTabletRequest}} and {{DeleteTabletRequest}} RPCs > as soon as they are realized by > {{CatalogManager::ProcessPendingAssignments()}} > when processing the list of deferred DDL operations, and at this level there > isn't any restrictions on how many of those might be in flight or sent to > a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} > flag, > but it works on a higher level and only during initial creation of a table). > The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't > created within {{\-\-tablet_creation_timeout_ms}} milliseconds, catalog > manager > replaces all the tablet replicas, generating a new tablet UUID and sending > corresponding {{CreateTabletRequest}} RPCs to a potentially different set of > tablet > servers. Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas > of the > stalled-during-creation tablet) are sent separately in an asynchronous way > as well. > There are at least two issues with this approach: > # The {{\-\-max_create_tablets_per_ts}} threshold limits the number of > concurrent requests hitting one tablet server only during the initial > creation of a table. However, nothing limits how many requests to create a > table replica might hit a tablet server when adding partitions to an existing > table as a result of ALTER TABLE request. > # {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of > corresponding tablet servers, and catalog manager stops retrying sending those > after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval. This might spiral > into a situation when requests to create replacement tablet replicas are > passing through and executed by tablet servers, but corresponding requests to > delete tablet replica cannot get through because of queue overflows, with > catalog manager eventually giving up retrying the latter ones. Eventually, > tablet servers end up with huge number of tablet replicas created, and they > crash running out of memory. The crashed tablet servers cannot start after > that because they eventually run out of memory trying to bootstrap the huge > number of tablet replicas (running out of memory again). See > https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and > [KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding > issue reported some time ago. > Ideally, the catalog manager should put all the generated > {{\{Create,Delete\}TabletRequests}} into per-tserver queues and make sure at > most {{N}} requests are being processed by a tablet server at a time (the > {{N}} parameter is a replacement for {{\-\-max_create_tablets_per_ts}} in > this safer approach). -- This message was sent by Atlassian Jira (v8.20.1#820001)