Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23823 )

Change subject: IMPALA-14618: Ensure processor clear waits for active event 
processing
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@12
PS1, Line 12: by those processors.
> DbProcessors in a DbEventExecutor share single thread. So only one of the D
Ack


http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@14
PS1, Line 14: it could update
            : the cache with stale information after the metadata rebuild
> Below is the sequence of operations from logs:
What seems a bug is how CREATE_DATABASE event gets a null db here (so proceeds 
with addDb()):
https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L996

If catalog_.getDb() is invoked after reset(), it should see the db.
If catalog_.getDb() is invoked during reset(), it should wait until the db is 
loaded in reset(). But we just add the waiting for the initial reset which 
seems a bug.
https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L538
https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L500

This allows the CREATE_DATABASE event to saw the db doesn't exist and then 
decide to add it. Note that acquiring the catalog version write lock happens 
after the existence check. So reset() adds the db with the correct table list. 
But then CREATE_DATABASE event replace it with a db that has empty table list. 
It's OK if the following CREATE_TABLE event can add back the table. But the 
event is skipped since event processor is paused and then restarted at a newer 
event id.

I think the issue could also happen when Hierarchical Event Processing is 
disabled. At the begining of reset(), we set the event processor status to 
PAUSED and then move forward. But the event processor thread could still be in 
processing a CREATE_DATABASE event and then causes the same issue.

Maybe fixing the issue in getDb() is better.


http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@18
PS1, Line 18: This change ensures that processor clear waits for any ongoing 
event
            : processing to complete before clearing internal state.
> IMO, we need to wait for event processing to completely pause before procee
A single ALTER_TABLE event could take minutes or longer to process if it 
reloads a large table. That already impacts the reset performance which then 
impacts the catalogd HA warm failover. If we fix the other bug mentioned above, 
we don't need this waiting. Otherwise, it's OK to let the thread modify the 
stale catalog objects concurrently, if the catalog object can't be referenced 
from CatalogServiceCatalog anymore.



--
To view, visit http://gerrit.cloudera.org:8080/23823
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib04ee2fa425cb4be397232a0065b6fdbfc1e244c
Gerrit-Change-Number: 23823
Gerrit-PatchSet: 2
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]>
Gerrit-Comment-Date: Mon, 12 Jan 2026 13:06:28 +0000
Gerrit-HasComments: Yes

Reply via email to