Csaba Ringhofer created IMPALA-13878:
----------------------------------------
Summary: Prioritize recently read files in scan range order
Key: IMPALA-13878
URL: https://issues.apache.org/jira/browse/IMPALA-13878
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Csaba Ringhofer
Currently in a query each host computes a fixed scan range order and fragment
instances/scanner threads pop from this queue (IMPALA-11539).
The order only considers scan range size (preferring larger) and HDFS caching.
Giving preference to recently read files could improve the efficiency of caches
involved in IO (OS's IO cache, remote data cache, file handle cache). E.g.
assuming LRU if a query reads file f1 ... f100 (in this order), and only 10
files with to cache, then when repeating the query it is more efficient to read
in the opposite order to read the last 10 files first that are still in cache.
Ideally the order would be dynamic as parallel queries may read a file making
it "hot" even if it was cold during planning.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]