Re: [I] [Proposal] Scan I/O acceleration: node-local fragment cache, asynchronous prefetch, and cache-affinity scheduling [datafusion-comet]

via GitHub Sat, 04 Jul 2026 11:22:20 -0700


schenksj commented on issue #4695:
URL: 
https://github.com/apache/datafusion-comet/issues/4695#issuecomment-4883378862


   ### Update: initial implementation + benchmarks
   
   We've built an initial implementation of two of the three pieces in this 
proposal:
   
   - **Node-local fragment cache** — block-aligned, sitting behind the 
`object_store` API, with an
     in-memory tier (SIEVE eviction, single-flight miss dedup, coalesced 
fetches) and an on-disk SSD
     tier (region files, crc32c, wholesale region eviction, an admission gate 
that filters single-pass
     scan traffic). It relies on data-file immutability (ETag/version 
validation, no invalidation
     protocol). The cache core is storage-API-neutral (no 
`object_store`/`opendal`/engine deps) with a
     thin `object_store` adapter, so the same core can back the Delta/Iceberg 
paths later.
   - **Cache-affinity scheduling** — driver-side sticky file→host assignment 
exposed through
     `RDD.getPreferredLocations`, so repeat reads of a file route back to the 
executor that already
     cached it (turning per-host private caches into a coherent cluster cache).
   
   Both are opt-in / default-off, and correctness never depends on the cache or 
on locality, exactly
   as framed above. **Asynchronous prefetch (the third piece) is not 
implemented yet.**
   
   Below are cold-vs-warm numbers across scan / aggregation / join shapes, to 
gauge the effect.
   
   **Important setup caveat:** these were run on a small **MacBook Air** 
(`local[4]`) reading from
   **S3 over the public internet on commercial broadband** — so cold times 
include WAN round-trips and
   are *slower* than a co-located, in-region executor would see. The absolute 
numbers and speedups are
   therefore **illustrative, not datacenter-representative**; a real cluster 
with S3 in-region would
   show smaller (but still meaningful) ratios. Each scenario forces a genuine 
cold read (the cache is
   cleared), then times warm runs; `ColdFetch`/`WarmFetch` are bytes read from 
S3 per run (warm ≈ 0 =
   served from cache).
   
   **Incompressible data (random doubles, ~0.5 GB on disk):**
   
   ```
   Category Scenario                  Cold(ms)  Warm(ms)  Speedup  ColdFetch  
WarmFetch
   
------------------------------------------------------------------------------------
   scan     full scan (16 cols)          30519       503    60.7x    514.8MB    
  0.0MB
   scan     projection (1 col)            3594       147    24.4x     66.8MB    
  0.0MB
   scan     projection (4 cols)           9647       115    83.9x    146.8MB    
  0.0MB
   scan     filter (v2 < 0.25)            5410       119    45.5x     82.8MB    
  0.0MB
   agg      group-by aggregation          5472       158    34.6x     98.8MB    
  0.0MB
   agg      count distinct key            2885        77    37.5x     50.8MB    
  0.0MB
   join     join fact-dim                 5150       149    34.6x     82.8MB    
  0.0MB
   join     join + group-by               6319       176    35.9x     82.8MB    
  0.0MB
   
------------------------------------------------------------------------------------
   mean warm speedup: 44.6x across 8 scenarios
   ```
   
   **Compressible data (id-derived; Parquet dictionary/RLE crushes it to ~8 
MB):**
   
   ```
   Category Scenario                  Cold(ms)  Warm(ms)  Speedup  ColdFetch  
WarmFetch
   
------------------------------------------------------------------------------------
   scan     full scan (16 cols)           1991        81    24.6x      7.8MB    
  0.0MB
   scan     projection (1 col)             677        38    17.8x      7.8MB    
  0.0MB
   scan     projection (4 cols)           1264        48    26.3x      7.8MB    
  0.0MB
   scan     filter (v2 < 0.25)             730        38    19.2x      7.8MB    
  0.0MB
   agg      group-by aggregation           852       124     6.9x      7.8MB    
  0.0MB
   agg      count distinct key             599        68     8.8x      7.8MB    
  0.0MB
   join     join fact-dim                  736        66    11.2x      7.8MB    
  0.0MB
   join     join + group-by                771       101     7.6x      7.8MB    
  0.0MB
   
------------------------------------------------------------------------------------
   mean warm speedup: 15.3x across 8 scenarios
   ```
   
   A few observations:
   
   - The cache serves real data volume (full scan ~515 MB cold → ~0 warm) and 
only fetches the blocks
     a query touches (a 1-column projection reads ~67 MB vs ~515 MB for the 
full 16-column scan).
   - Even the tiny compressed file still shows a large speedup — the cold cost 
over object storage is
     dominated by request/round-trip latency, so the cache helps for the 
compressed data real tables
     actually have, not just large files.
   - Aggregation/join gains are lower than pure scans because shuffle/hash 
compute is a fixed cost the
     cache doesn't touch — a useful signal about where scan-side caching pays 
off most.
   
   Happy to share the detailed design document (architecture, injection points, 
config surface,
   metrics, test plan) and to break the work into per-phase child issues as 
suggested above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Proposal] Scan I/O acceleration: node-local fragment cache, asynchronous prefetch, and cache-affinity scheduling [datafusion-comet]

Reply via email to