Divij Vaidya created KAFKA-15481: ------------------------------------ Summary: Concurrency bug in RemoteIndexCache leads to IOException Key: KAFKA-15481 URL: https://issues.apache.org/jira/browse/KAFKA-15481 Project: Kafka Issue Type: Bug Affects Versions: 3.6.0 Reporter: Divij Vaidya Fix For: 3.7.0
RemoteIndexCache has a concurrency bug which leads to IOException while fetching data from remote tier. Below events in order of timeline - Thread 1 (cache thread): invalidates the entry, removalListener is invoked async, so the files have not been renamed to "deleted" suffix yet. Thread 2: (fetch thread): tries to find entry in cache, doesn't find it because it has been removed by 1, fetches the entry from S3, writes it to existing file (using replace existing) Thread 1: async removalListener is invoked, acquires a lock on old entry (which has been removed from cache), it renames the file to "deleted" and starts deleting it Thread 2: Tries to create in-memory/mmapped index, but doesn't find the file and hence, creates a new file of size 2GB in AbstractIndex constructor. JVM returns an error as it won't allow creation of 2GB random access file. *Potential Fix* Use EvictionListener instead of RemovalListener in Caffeine cache as per the documentation: {quote} When the operation must be performed synchronously with eviction, use {{Caffeine.evictionListener(RemovalListener)}} instead. This listener will only be notified when {{RemovalCause.wasEvicted()}} is true. For an explicit removal, {{Cache.asMap()}} offers compute methods that are performed atomically.{quote} This will ensure that removal from cache and marking the file with delete suffix is synchronously done, hence the above race condition will not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)