kosiew commented on code in PR #23201:
URL: https://github.com/apache/datafusion/pull/23201#discussion_r3497939086
##########
datafusion/execution/src/cache/mod.rs:
##########
@@ -165,3 +168,145 @@ impl Display for TableScopedPath {
}
}
}
+
+/// A fingerprint of the `file_schema` used to compute a file's statistics.
+///
+/// Captures exactly the attributes that determine the layout and meaning of
+/// `Statistics::column_statistics`: each column's name, data type and
+/// nullability, in order. It deliberately excludes field/schema metadata,
which
+/// cannot affect statistics — including it would needlessly fragment the
cache.
+#[derive(Clone, Debug)]
+pub struct SchemaFingerprint {
Review Comment:
@mkleen,
> have a look at the benchmarks results clickbench_partitioned and
wide_schema you will find that without this optimization there are real
regressions.
Sorry I missed this.
@Phoenix500526,
I wonder whether it would be viable to keep `TableScopedPath` as the cache
key and store the schema fingerprint in the cached value too? The lookup would
be:
1. look up by the existing `{table, path}` key;
2. validate file metadata as today (`size`, `last_modified`);
3. also validate `cached.schema_fingerprint == current_schema_fingerprint`;
4. if either validation fails, treat it as a miss and overwrite the entry
for that `{table, path}`.
This still fixes the bug: stats computed for one schema cannot be reused for
another schema. It also preserves cache reuse for repeated anonymous
explicit-schema reads with the same schema, which is the improvement over
#22950.
The tradeoff is that we would only keep one schema's stats per `{table,
path}`, so alternating schemas for the same file may recompute. That seems
acceptable to me because the previous fallback was to skip caching for this
problematic case entirely, and a full multi-schema cache feels more complex
than this issue needs.
Benchmark-wise, this may improve Q6 if the regression is primarily from
schema-aware cache keys, because the key hash/probe returns to the old cheap
`{table, path}` shape. We would still need to verify with
`clickbench_partitioned`, since value-side validation still has per-hit cost:
the current schema fingerprint must exist and cache hits still need schema
validation/fingerprint comparison. For wide schemas, the precomputed
fingerprint hash can still keep that validation cheap. Memory/eviction should
also improve because there is at most one entry per `{table, path}` rather than
one per `{table, path, schema}`.
If we take this approach, `FileStatisticsCache` can likely remain keyed by
`TableScopedPath`, so the 55.0.0 upgrade-guide entry for changing the public
cache key type can be removed or reduced to any smaller `CachedFileMetadata`
API change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]