Divij Vaidya created KAFKA-15481:
------------------------------------
Summary: Concurrency bug in RemoteIndexCache leads to IOException
Key: KAFKA-15481
URL: https://issues.apache.org/jira/browse/KAFKA-15481
Project: Kafka
Issue Type: Bug
Affects Versions: 3.6.0
Reporter: Divij Vaidya
Fix For: 3.7.0
RemoteIndexCache has a concurrency bug which leads to IOException while
fetching data from remote tier.
Below events in order of timeline -
Thread 1 (cache thread): invalidates the entry, removalListener is invoked
async, so the files have not been renamed to "deleted" suffix yet.
Thread 2: (fetch thread): tries to find entry in cache, doesn't find it because
it has been removed by 1, fetches the entry from S3, writes it to existing file
(using replace existing)
Thread 1: async removalListener is invoked, acquires a lock on old entry (which
has been removed from cache), it renames the file to "deleted" and starts
deleting it
Thread 2: Tries to create in-memory/mmapped index, but doesn't find the file
and hence, creates a new file of size 2GB in AbstractIndex constructor. JVM
returns an error as it won't allow creation of 2GB random access file.
*Potential Fix*
Use EvictionListener instead of RemovalListener in Caffeine cache as per the
documentation:
{quote} When the operation must be performed synchronously with eviction, use
{{Caffeine.evictionListener(RemovalListener)}} instead. This listener will only
be notified when {{RemovalCause.wasEvicted()}} is true. For an explicit
removal, {{Cache.asMap()}} offers compute methods that are performed
atomically.{quote}
This will ensure that removal from cache and marking the file with delete
suffix is synchronously done, hence the above race condition will not occur.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)