[jira] [Updated] (IMPALA-13878) Prioritize recently read files in scan range order

Csaba Ringhofer (Jira) Sun, 30 Mar 2025 06:10:25 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-13878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-13878:
-------------------------------------
    Description: 
Currently in a query each host computes a fixed scan range order and fragment 
instances/scanner threads pop from this queue (IMPALA-11539).
The order only considers scan range size (preferring larger) and HDFS caching.

Giving preference to recently read files could improve the efficiency of caches 
involved in IO (OS's IO cache, remote data cache, file handle cache). E.g. 
assuming LRU if a query reads file f1 ... f100 (in this order), and only 10 
files that fit to cache, then when repeating the query it is more efficient to 
read in the opposite order to read the last 10 files first that are still in 
cache.

Ideally the order would be dynamic as parallel queries may read a file making 
it "hot" even if it was "cold" during planning. 

  was:
Currently in a query each host computes a fixed scan range order and fragment 
instances/scanner threads pop from this queue (IMPALA-11539).
The order only considers scan range size (preferring larger) and HDFS caching.

Giving preference to recently read files could improve the efficiency of caches 
involved in IO (OS's IO cache, remote data cache, file handle cache). E.g. 
assuming LRU if a query reads file f1 ... f100 (in this order), and only 10 
files with to cache, then when repeating the query it is more efficient to read 
in the opposite order to read the last 10 files first that are still in cache.

Ideally the order would be dynamic as parallel queries may read a file making 
it "hot" even if it was "cold" during planning. 


> Prioritize recently read files in scan range order
> --------------------------------------------------
>
>                 Key: IMPALA-13878
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13878
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> Currently in a query each host computes a fixed scan range order and fragment 
> instances/scanner threads pop from this queue (IMPALA-11539).
> The order only considers scan range size (preferring larger) and HDFS caching.
> Giving preference to recently read files could improve the efficiency of 
> caches involved in IO (OS's IO cache, remote data cache, file handle cache). 
> E.g. assuming LRU if a query reads file f1 ... f100 (in this order), and only 
> 10 files that fit to cache, then when repeating the query it is more 
> efficient to read in the opposite order to read the last 10 files first that 
> are still in cache.
> Ideally the order would be dynamic as parallel queries may read a file making 
> it "hot" even if it was "cold" during planning. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-13878) Prioritize recently read files in scan range order

Reply via email to