sfluor opened a new issue, #16044:
URL: https://github.com/apache/datafusion/issues/16044

   ### Is your feature request related to a problem or challenge?
   
   The MetricValue enum currently exposes only single-value statistics: counts, 
gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or 
OutputRows.
   For many operational questions we really care about the shape of a metric’s 
distribution (e.g. What is the p99 elapsed-compute time?, How skewed is memory 
usage across partitions?).
   
   This is especially true when the ExecutionPlan is dispatched to multiple 
nodes / workers in a distributed system as part of multiple requests..
   
   Because there is no “distribution” metric type right now we can only track 
very simple metrics such as (avg / min / max).
   
   This makes it hard to pin-point outliers in terms of latencies or memory 
usage.
   
   ### Describe the solution you'd like
   
   Adding a new `Distribution` type to the list of MetricValues.
   
   That would look like:
   
   ```rust
       Distribution {
           /// The provided name of this metric
           name: Cow<'static, str>,
           /// A custom implementation of the metric value.
           value: Arc<Mutex<TDigest>,
       },
   ```
   
   
   ### Describe alternatives you've considered
   
   An alternative would be to expose something more generic to allow everyone 
to define their own ways of accumulating metrics throughout the plan execution:
   
   ```rust
       Custom {
           /// The provided name of this metric
           name: Cow<'static, str>,
           /// A custom implementation of the metric value.
           value: Arc<dyn CustomMetricValue>,
       },
   }
   
   trait CustomMetricValue: Debug + Send + Sync {
       fn new_empty(self: Arc<Self>) -> Arc<dyn CustomMetricValue>;
   
       fn aggregate(
           self: Arc<Self>,
           other: &dyn CustomMetricValue,
       ) -> Arc<dyn CustomMetricValue>;
   }
   ```
   
   This would allow to have more complex aggregations of metrics. For instance 
in the context of an execution plan issuing multiple requests, we could track 
the 5 slowest requests with their metadata.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to