Ahmar Suhail created HADOOP-18291:
-------------------------------------

             Summary: SingleFilePerBlockCache does not have a limit
                 Key: HADOOP-18291
                 URL: https://issues.apache.org/jira/browse/HADOOP-18291
             Project: Hadoop Common
          Issue Type: Sub-task
            Reporter: Ahmar Suhail


Currently there is no limit on the size of disk cache. This means we could have 
a large number of files on files, especially for access patterns that are very 
random and do not always read the block fully. 

 

eg:

in.seek(5);

in.read(); 

in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read

in.read();

in.seek(2 * blockSize + 10) // block 1 gets saved to disk

.. and so on

 

The in memory cache is bounded, and by default has a limit of 72MB (9 blocks). 
When a block is fully read, and a seek is issued it's released 
[here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
 We can also delete the on disk file for the block here if it exists. 

 

Also maybe add an upper limit on disk space, and delete the file which stores 
data of the block furthest from the current block (similar to the in memory 
cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to