[ 
https://issues.apache.org/jira/browse/IMPALA-14400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019216#comment-18019216
 ] 

ASF subversion and git services commented on IMPALA-14400:
----------------------------------------------------------

Commit dbac6ab13ad5cbd40ac31e9921265396de6c9433 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dbac6ab13 ]

IMPALA-14400: Fix deadlock in CatalogServiceCatalog.getDbProperty()

IMPALA-13850 (part 4) modify CatalogServiceCatalog.getDb() to delay
looking up catalog cache until initial reset() is complete.
EventProcessor can start processing event before reset() happen and
obtain versionLock_.readLock() when calling
CatalogServiceCatalog.getDbProperty(). Later on, it will hit deadlock
when attempting to obtain versionLock_.writeLock() through getDb() /
waitInitialResetCompletion(). This lock upgrade from read to write is
unsafe.

This patch mitigate the issue by changing waitInitialResetCompletion()
to not acquire write lock. After this patch, it will sleep for 100ms
before loop and checking again if initial reset has complete. Modified
CatalogResetManager.fetchingDbs_ to ConcurrentLinkedQueue so that
isActive() can be called without holding write lock.

Add helper class ReadLockAndLookupDb and WriteLockAndLookupDb. Both will
call waitInitialResetCompletion() before obtaining the appropriate lock.
In case of WriteLockAndLookupDb, it additionally will call
resetManager_.waitOngoingMetadataFetch() to ensure dbCache_ lookup is
safe for write purpose.

Skip calling catalog_.startEventsProcessor() in JniCatalog constructor.
Instead, let CatalogServiceCatalog.reset() start it at the end of cache
population.

Added @Nullable annotations on CatalogServiceCatalog methods that can
return null. Fixed some null check warnings that shows up afterwards.

Remove dead code CatalogServiceCatalog.addUserIfNotExists() and
CatalogOpExecutor.getCurrentEventId().

Testing:
Increase TRIGGER_RESET_METADATA_DELAY from 1s to 3s in
test_metadata_after_failover_with_delayed_reset. It was easy to hit the
deadlock with 3s delay before the patch. No more deadlock happen after
the patch.
Run and pass test_catalogd_ha.py and test_restart_services.py
exhaustively.

Change-Id: I3162472ea9531add77886bf1d0d73460ff34d07a
Reviewed-on: http://gerrit.cloudera.org:8080/23382
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Riza Suminto <[email protected]>


> Deadlock in CatalogServiceCatalog due to read lock upgrade to a write lock
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-14400
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14400
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 5.0.0
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>         Attachments: jstack_dump.txt
>
>
> test_metadata_after_failover_with_delayed_reset from a precommit jenkins job 
> caught a case where CatalogD hang with deadlock during startup.
> This is easily hit when TRIGGER_RESET_METADATA_DELAY is increased to 3s. The 
> reason is, EventProcessor attempt to obtain versionLock_.writeLock() in 
> CatalogServiceCatalog.getDb() after obtaining versionLock_.readLock() in 
> CatalogServiceCatalog.getDbProperty() (unsafe lock upgrade from read to 
> write).
> [^jstack_dump.txt] shows the jstack when the deadlock occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to