Csaba Ringhofer created IMPALA-13878:
----------------------------------------

             Summary: Prioritize recently read files in scan range order
                 Key: IMPALA-13878
                 URL: https://issues.apache.org/jira/browse/IMPALA-13878
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


Currently in a query each host computes a fixed scan range order and fragment 
instances/scanner threads pop from this queue (IMPALA-11539).
The order only considers scan range size (preferring larger) and HDFS caching.

Giving preference to recently read files could improve the efficiency of caches 
involved in IO (OS's IO cache, remote data cache, file handle cache). E.g. 
assuming LRU if a query reads file f1 ... f100 (in this order), and only 10 
files with to cache, then when repeating the query it is more efficient to read 
in the opposite order to read the last 10 files first that are still in cache.

Ideally the order would be dynamic as parallel queries may read a file making 
it "hot" even if it was cold during planning. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to