Hi all,

   I'm working on a feature for Apache Polaris to persist Iceberg metrics 
(ScanReport and CommitReport) to a database for observability and analytics 
purposes.

   While implementing this, I noticed an asymmetry between the two report types:

     • CommitReport includes detailed row-level metrics via 
CommitMetricsResult: addedRecords, deletedRecords, addedPositionalDeletes, 
addedEqualityDeletes, etc.
     • ScanReport captures planning metrics via ScanMetricsResult (result 
data/delete files, scanned/skipped data manifests, etc.) but doesn't include 
the actual number of rows returned by the scan.

   I understand that ScanReport is generated during the planning phase, so the 
actual row count isn't known at that point. However, for observability use cases
   (query analytics, capacity planning, detecting expensive scans), knowing how 
many rows were actually read would be valuable.

   A few questions for the community:

     1. Is there a technical reason why row counts weren't included in 
ScanReport? Is it because the metrics reporter doesn't have visibility into the 
execution
        phase?

     2. Would there be interest in extending the metrics API to support 
post-execution scan metrics that include row counts? This could potentially be:
        • A new ScanCompletionReport sent after execution
        • An optional field in ScanReport that could be populated if available
        • A callback mechanism for engines to report execution-time metrics

     3. How do others currently track rows-read metrics for Iceberg tables? Are 
there engine-specific solutions (Spark metrics, Trino query stats, etc.) that 
people
        rely on instead?

   For context, our use case is building a centralized metrics store where 
teams can analyze table access patterns, identify hot tables, and understand 
read/write workloads across their data lake.

   Any insights or suggestions would be appreciated!

   Thanks,

-
Anand K Sankaran
Workday Data Cloud

Reply via email to