Divij Vaidya created KAFKA-15481:
------------------------------------

             Summary: Concurrency bug in RemoteIndexCache leads to IOException
                 Key: KAFKA-15481
                 URL: https://issues.apache.org/jira/browse/KAFKA-15481
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.6.0
            Reporter: Divij Vaidya
             Fix For: 3.7.0


RemoteIndexCache has a concurrency bug which leads to IOException while 
fetching data from remote tier.

Below events in order of timeline -

Thread 1 (cache thread): invalidates the entry, removalListener is invoked 
async, so the files have not been renamed to "deleted" suffix yet.

Thread 2: (fetch thread): tries to find entry in cache, doesn't find it because 
it has been removed by 1, fetches the entry from S3, writes it to existing file 
(using replace existing)

Thread 1: async removalListener is invoked, acquires a lock on old entry (which 
has been removed from cache), it renames the file to "deleted" and starts 
deleting it

Thread 2: Tries to create in-memory/mmapped index, but doesn't find the file 
and hence, creates a new file of size 2GB in AbstractIndex constructor. JVM 
returns an error as it won't allow creation of 2GB random access file.

*Potential Fix*
Use EvictionListener instead of RemovalListener in Caffeine cache as per the 
documentation:
{quote} When the operation must be performed synchronously with eviction, use 
{{Caffeine.evictionListener(RemovalListener)}} instead. This listener will only 
be notified when {{RemovalCause.wasEvicted()}} is true. For an explicit 
removal, {{Cache.asMap()}} offers compute methods that are performed 
atomically.{quote}
This will ensure that removal from cache and marking the file with delete 
suffix is synchronously done, hence the above race condition will not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to