Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/23823 )
Change subject: IMPALA-14618: Ensure processor clear waits for active event processing ...................................................................... Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@12 PS1, Line 12: by those processors. > DbProcessors in a DbEventExecutor share single thread. So only one of the D Ack http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@14 PS1, Line 14: it could update : the cache with stale information after the metadata rebuild > Below is the sequence of operations from logs: What seems a bug is how CREATE_DATABASE event gets a null db here (so proceeds with addDb()): https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L996 If catalog_.getDb() is invoked after reset(), it should see the db. If catalog_.getDb() is invoked during reset(), it should wait until the db is loaded in reset(). But we just add the waiting for the initial reset which seems a bug. https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L538 https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L500 This allows the CREATE_DATABASE event to saw the db doesn't exist and then decide to add it. Note that acquiring the catalog version write lock happens after the existence check. So reset() adds the db with the correct table list. But then CREATE_DATABASE event replace it with a db that has empty table list. It's OK if the following CREATE_TABLE event can add back the table. But the event is skipped since event processor is paused and then restarted at a newer event id. I think the issue could also happen when Hierarchical Event Processing is disabled. At the begining of reset(), we set the event processor status to PAUSED and then move forward. But the event processor thread could still be in processing a CREATE_DATABASE event and then causes the same issue. Maybe fixing the issue in getDb() is better. http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@18 PS1, Line 18: This change ensures that processor clear waits for any ongoing event : processing to complete before clearing internal state. > IMO, we need to wait for event processing to completely pause before procee A single ALTER_TABLE event could take minutes or longer to process if it reloads a large table. That already impacts the reset performance which then impacts the catalogd HA warm failover. If we fix the other bug mentioned above, we don't need this waiting. Otherwise, it's OK to let the thread modify the stale catalog objects concurrently, if the catalog object can't be referenced from CatalogServiceCatalog anymore. -- To view, visit http://gerrit.cloudera.org:8080/23823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib04ee2fa425cb4be397232a0065b6fdbfc1e244c Gerrit-Change-Number: 23823 Gerrit-PatchSet: 2 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]> Gerrit-Comment-Date: Mon, 12 Jan 2026 13:06:28 +0000 Gerrit-HasComments: Yes
