Riza Suminto has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/23382 )
Change subject: IMPALA-14400: Fix deadlock in CatalogServiceCatalog.getDbProperty() ...................................................................... IMPALA-14400: Fix deadlock in CatalogServiceCatalog.getDbProperty() IMPALA-13850 (part 4) modify CatalogServiceCatalog.getDb() to delay looking up catalog cache until initial reset() is complete. EventProcessor can start processing event before reset() happen and obtain versionLock_.readLock() when calling CatalogServiceCatalog.getDbProperty(). Later on, it will hit deadlock when attempting to obtain versionLock_.writeLock() through getDb() / waitInitialResetCompletion(). This lock upgrade from read to write is unsafe. This patch mitigate the issue by changing waitInitialResetCompletion() to not acquire write lock. After this patch, it will sleep for 100ms before loop and checking again if initial reset has complete. Modified CatalogResetManager.fetchingDbs_ to ConcurrentLinkedQueue so that isActive() can be called without holding write lock. Add helper class ReadLockAndLookupDb and WriteLockAndLookupDb. Both will call waitInitialResetCompletion() before obtaining the appropriate lock. In case of WriteLockAndLookupDb, it additionally will call resetManager_.waitOngoingMetadataFetch() to ensure dbCache_ lookup is safe for write purpose. Skip calling catalog_.startEventsProcessor() in JniCatalog constructor. Instead, let CatalogServiceCatalog.reset() start it at the end of cache population. Added @Nullable annotations on CatalogServiceCatalog methods that can return null. Fixed some null check warnings that shows up afterwards. Remove dead code CatalogServiceCatalog.addUserIfNotExists() and CatalogOpExecutor.getCurrentEventId(). Testing: Increase TRIGGER_RESET_METADATA_DELAY from 1s to 3s in test_metadata_after_failover_with_delayed_reset. It was easy to hit the deadlock with 3s delay before the patch. No more deadlock happen after the patch. Run and pass test_catalogd_ha.py and test_restart_services.py exhaustively. Change-Id: I3162472ea9531add77886bf1d0d73460ff34d07a Reviewed-on: http://gerrit.cloudera.org:8080/23382 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Riza Suminto <[email protected]> --- M fe/src/main/java/org/apache/impala/catalog/CatalogResetManager.java M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/JniCatalog.java M tests/custom_cluster/test_catalogd_ha.py 5 files changed, 179 insertions(+), 140 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved Riza Suminto: Verified -- To view, visit http://gerrit.cloudera.org:8080/23382 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3162472ea9531add77886bf1d0d73460ff34d07a Gerrit-Change-Number: 23382 Gerrit-PatchSet: 13 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]>
