Hello, I have rewritten most of the proposal. Though I think that there is some more research that needs to be done to get the Metric specification perfect. I plan to do more research, and would like to ask you all for more help to make this proposal better. In particular, now that the metrics format by default is designed to allow metrics to pass through to monitoring collection systems such as Dropwizard and Stackdriver, they need to be complete enough to be compatible with these systems.
I think some changes will be needed to fulfill this, but I wanted to send out this document, which contains the general idea, and continue refining it. Please take a look and let me know what you think. https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit Major Revision: April 17, 2018 The design has been reworked, to use a metric format which resembles Dropwizard and Stackdriver formats, allowing metrics to be passed through. The generic bytes payload style of metrics is still available but is reserved for complex use cases which do not fit into these typical metrics collection systems. Note: This document isn’t 100% complete, there are a few areas which need to be improved, though our discussion and more research I want to complete these details. Please share any thoughts that you have. 1. The metric specification and Metric proto schemas may need revisions: 1. The distribution format needs to be refined so that its compatible with Stackdriver and Dropwizard. The current example format is. A second distribution format need. 2. Annotations needs to be examine in detail, if there are first class annotations which should be supported to pass through properly to Dropwizard and Stackdriver. 3. Aggregation functions may need parameters. For example Top(n) may need to be parameterized. How should this best be supported. On Tue, Apr 17, 2018 at 11:10 AM Ben Chambers <[email protected]> wrote: > That sounds like a very reasonable choice -- given the discussion seemed > to be focusing on the differences between these two categories, separating > them will allow the proposal (and implementation) to address each category > in the best way possible without needing to make compromises. > > Looking forward to the updated proposal. > > On Tue, Apr 17, 2018 at 10:53 AM Alex Amato <[email protected]> wrote: > >> Hello, >> >> I just wanted to give an update . >> >> After some discussion, I've realized that its best to break up the two >> concepts, with two separate way of reporting monitoring data. These two >> categories are: >> >> 1. Metrics - Counters, Gauges, Distributions. These are well defined >> concepts for monitoring information and ned to integrate with existing >> metrics collection systems such as Dropwizard and Stackdriver. Most >> metrics >> will go through this model, which will allow runners to process new >> metrics >> without adding extra code to support them, forwarding them to metric >> collection systems. >> 2. Monitoring State - This supports general monitoring data which may >> not fit into the standard model for Metrics. For example an I/O source may >> provide a table of filenames+metadata, for files which are old and >> blocking >> the system. I will propose a general approach, similar to the URN+payload >> approach used in the doc right now. >> >> One thing to keep in mind -- even though it makes sense to allow each I/O > source to define their own monitoring state, this then shifts > responsibility for collecting that information to each runner and > displaying that information to every consumer. It would be reasonable to > see if there could be a set of 10 or so that covered most of the cases that > could become the "standard" set (eg., watermark information, performance > information, etc.). > > >> I will rewrite most of the doc and propose separating these two very >> different use cases, one which optimizes for integration with existing >> monitoring systems. The other which optimizes for flexibility, allowing >> more complex and custom metrics formats for other debugging scenarios. >> >> I just wanted to give a brief update on the direction of this change, >> before writing it up in full detail. >> >> >> On Mon, Apr 16, 2018 at 10:36 AM Robert Bradshaw <[email protected]> >> wrote: >> >>> I agree that the user/system dichotomy is false, the real question of >>> how counters can be scoped to avoid accidental (or even intentional) >>> interference. A system that entirely controls the interaction between the >>> "user" (from its perspective) and the underlying system can do this by >>> prefixing all requested "user" counters with a prefix it will not use >>> itself. Of course this breaks down whenever the wrapping isn't complete >>> (either on the production or consumption side), but may be worth doing for >>> some components (like the SDKs that value being able to provide this >>> isolation for better behavior). Actual (human) end users are likely to be >>> much less careful about avoiding conflicts than library authors who in turn >>> are generally less careful than authors of the system itself. >>> >>> We could alternatively allow for specifying fully qualified URNs for >>> counter names in the SDK APIs, and letting "normal" user counters be in the >>> empty namespace rather than something like beam:metrics:{user,other,...}, >>> perhaps with SDKs prohibiting certain conflicting prefixes (which is less >>> than ideal). A layer above the SDK that has similar absolute control over >>> its "users" would have a similar decision to make. >>> >>> >>> On Sat, Apr 14, 2018 at 4:00 PM Kenneth Knowles <[email protected]> wrote: >>> >>>> One reason I resist the user/system distinction is that Beam is a >>>> multi-party system with at least SDK, runner, and pipeline. Often there may >>>> be a DSL like SQL or Scio, or similarly someone may be building a platform >>>> for their company where there is no user authoring the pipeline. Should >>>> Scio, SQL, or MyCompanyFramework metrics end up in "user"? Who decides to >>>> tack on the prefix? It looks like it is the SDK harness? Are there just >>>> three namespaces "runner", "sdk", and "user"? Most of what you'd >>>> think of as "user" version "system" should simply be the different between >>>> dynamically defined & typed metrics and fields in control plane protos. If >>>> that layer of the namespaces is not finite and limited, who can extend make >>>> a valid extension? Just some questions that I think would flesh out the >>>> meaning of the "user" prefix. >>>> >>>> Kenn >>>> >>>> On Fri, Apr 13, 2018 at 5:26 PM Andrea Foegler <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks, Robert! >>>>>>> >>>>>>> I think my lack of clarity is around the MetricSpec. Maybe what's >>>>>>> in my head and what's being proposed are the same thing. When I read >>>>>>> that >>>>>>> the MetricSpec describes the proto structure, that sound kind of >>>>>>> complicated to me. But I may be misinterpreting it. What I picture is >>>>>>> something like a MetricSpec that looks like (note: my picture looks a >>>>>>> lot >>>>>>> like Stackdriver :): >>>>>>> >>>>>>> { >>>>>>> name: "my_timer" >>>>>>> >>>>>> >>>>>> name: "beam:metric:user:my_namespace:my_timer" (assuming we want to >>>>>> keep requiring namespaces). Or "beam:metric:[some non-user designation]" >>>>>> >>>>> >>>>> Sure. Looks good. >>>>> >>>>> >>>>>> >>>>>> labels: { "ptransform" } >>>>>>> >>>>>> >>>>>> How does an SDK act on this information? >>>>>> >>>>> >>>>> The SDK is obligated to submit any metric values for that spec with a >>>>> "ptransform" -> "transformName" in the labels field. Autogenerating code >>>>> from the spec to avoid typos should be easy. >>>>> >>>>> >>>>>> >>>>>> >>>>>>> type: GAUGE >>>>>>> value_type: int64 >>>>>>> >>>>>> >>>>>> I was lumping type and value_type into the same field, as a urn for >>>>>> possibly extensibility, as they're tightly coupled (e.g. quantiles, >>>>>> distributions). >>>>>> >>>>> >>>>> My inclination is that keeping this set relatively small and fixed to >>>>> a set that can be readily exported to external monitoring systems is more >>>>> useful than the added indirection to support extensibility. Lumping >>>>> together seems reasonable. >>>>> >>>>> >>>>>> >>>>>> >>>>>>> units: SECONDS >>>>>>> description: "Times my stuff" >>>>>>> >>>>>> >>>>>> Are both of these optional metadata, in the form of key-value field, >>>>>> for flattened into the field itself (along with every other kind of >>>>>> metadata you may want to attach)? >>>>>> >>>>> >>>>> Optional metadata in the form of fixed fields. Is there a use case >>>>> for arbitrary metadata? What would you do with it when exporting? >>>>> >>>>> >>>>>> >>>>>> >>>>>>> } >>>>>>> >>>>>>> Then metrics submitted would look like: >>>>>>> { >>>>>>> name: "my_timer" >>>>>>> labels: {"ptransform": "MyTransform"} >>>>>>> int_value: 100 >>>>>>> } >>>>>>> >>>>>> >>>>>> Yes, or value could be a bytes field that is encoded according to >>>>>> [value_]type above, if we want that extensibility (e.g. if we want to >>>>>> bundle the pardo sub-timings together, we'd need a proto for the value, >>>>>> but >>>>>> that seems to specific to hard code into the basic structure). >>>>>> >>>>>> >>>>> The simplicity coming from the fact that there's only one proto format >>>>>>> for the spec and for the value. The only thing that varies are the >>>>>>> entries >>>>>>> in the map and the value field set. It's pretty easy to establish >>>>>>> contracts around this type of spec and even generate protos for use the >>>>>>> in >>>>>>> SDK that make the expectations explicit. >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 13, 2018 at 2:23 PM Robert Bradshaw <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> On Fri, Apr 13, 2018 at 1:32 PM Kenneth Knowles <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Or just "beam:counter:<namespace>:<name>" or even >>>>>>>>> "beam:metric:<namespace>:<name>" since metrics have a type separate >>>>>>>>> from >>>>>>>>> their name. >>>>>>>>> >>>>>>>> >>>>>>>> I proposed keeping the "user" in there to avoid possible clashes >>>>>>>> with the system namespaces. (No preference on counter vs. metric, I >>>>>>>> wasn't >>>>>>>> trying to imply counter = SumInts) >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Apr 13, 2018 at 2:02 PM Andrea Foegler <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I like the generalization from entity -> labels. I view the >>>>>>>>> purpose of those fields to provide context. And labels feel like they >>>>>>>>> supports a richer set of contexts. >>>>>>>>> >>>>>>>> >>>>>>>> If we think such a generalization provides value, I'm fine with >>>>>>>> doing that now, as sets or key-value maps, if we have good enough >>>>>>>> examples >>>>>>>> to justify this. >>>>>>>> >>>>>>>> >>>>>>>>> The URN concept gets a little tricky. I totally agree that the >>>>>>>>> context fields should not be embedded in the name. >>>>>>>>> There's a "name" which is the identifier that can be used to >>>>>>>>> communicate what context values are supported / allowed for metrics >>>>>>>>> with >>>>>>>>> that name (for example, element_count expects a ptransform ID). But >>>>>>>>> then >>>>>>>>> there's the context. In Stackdriver, this context is a map of >>>>>>>>> key-value >>>>>>>>> pairs; the type is considered metadata associated with the name, but >>>>>>>>> not >>>>>>>>> communicated with the value. >>>>>>>>> >>>>>>>> >>>>>>>> I'm not quite following you here. If context contains a ptransform >>>>>>>> id, then it cannot be associated with a single name. >>>>>>>> >>>>>>>> >>>>>>>>> Could the URN be "beam:namespace:name" and every metric have a map >>>>>>>>> of key-value pairs for context? >>>>>>>>> >>>>>>>> >>>>>>>> The URN is the name. Something like >>>>>>>> "beam:metric:ptransform_execution_times:v1." >>>>>>>> >>>>>>>> >>>>>>>>> Not sure where this fits in the discussion or if this is handled >>>>>>>>> somewhere, but allowing for a metric configuration that's provided >>>>>>>>> independently of the value allows for configuring "type", "units", >>>>>>>>> etc in a >>>>>>>>> uniform way without having to encode them in the metric name / value. >>>>>>>>> Stackdriver expects each metric type has been configured ahead of >>>>>>>>> time with >>>>>>>>> these annotations / metadata. Then values are reported separately. >>>>>>>>> For >>>>>>>>> system metrics, the definitions can be packaged with the SDK. For >>>>>>>>> user >>>>>>>>> metrics, they'd be defined at runtime. >>>>>>>>> >>>>>>>> >>>>>>>> This feels like the metrics spec, that specifies that the metric >>>>>>>> with name/URN X has this type plus a bunch of other metadata (e.g. >>>>>>>> units, >>>>>>>> if they're not implicit in the type? This gets into whether the type >>>>>>>> should >>>>>>>> be Duration{Sum,Max,Distribution,...} vs. Int{Sum,Max,Distribution,...} >>>>>>>> + units metadata). >>>>>>>> >>>>>>>
