I'm not sure what I would want from DropWizard metrics. Most of the things we want to time happen just a few times in a job and are specific to a table.
For example, we want to know how long a particular query takes to plan. That is dependent on how large the table is and what filters were applied. That's why we've added a way to register listeners that can log those scan events for later analysis. I think I would continue with this approach rather than adding a metrics library. The events that we want to time have to be grouped by table and need to be gathered from many runs of a job or a query. So it makes more sense to improve the events that are generated and the data those events contain. rb On Mon, Feb 25, 2019 at 1:42 AM filip <filip....@gmail.com> wrote: > +1 on the distributed tracing, no obvious integration points. > Dropwizard metrics should suffice wrt to functional requirements, after > all it does work for Spark [1], right? Wrt to your ask on choosing and > established and reasonable dependencies set dependency I think Dropwizard > is the only option, no runner up afaik. > While we can rely on metrics that are specific to a particular Iceberg > implementation (i.e. Hadoop) there's still some interesting metrics I'd > consider more than nice-to-have tbh, like histograms of table operations > latencies, since for example an Iceberg file append commit operation may > consist of a up to a dozen effective Hadoop filesystem operations. > You have the experience of running Iceberg in production so I was looking > for advice on say top three metrics that you'd strongly consider before > running Iceberg in production? > > [1] https://spark.apache.org/docs/latest/monitoring.html > > On Thu, Feb 21, 2019 at 11:26 PM Ryan Blue <rb...@netflix.com> wrote: > >> Sounds like one of the first decision points is whether to use a >> framework with distributed tracing or not. I think I would opt for not >> requiring distributed tracing. >> >> Most of Iceberg is a self-contained library, so there are few points at >> which distributed tracing would make sense. Is there much value in tracing >> the metadata swap that happens in a metastore? I'm not sure there is. I >> think it would probably be sufficient to use a simpler metrics library. >> >> I've used DropWizard before, which I thought was trying to be the SLF4J >> of metrics. Is that still the case? I'd prefer to go with an established >> project that is likely to have broad support. And one that has a reasonable >> dependency set. >> >> On Mon, Feb 18, 2019 at 2:33 PM filip <filip....@gmail.com> wrote: >> >>> Both these solutions provide support for collecting metrics and >>> distributed tracing independent of the platform of choice. They seem to be >>> overlapping quite a lot though. >>> >>> OpenCensus [1] provides bindings for Go, Java, C++ and more [2] and it >>> also seems to support OOB backends and custom ones as well [3]. Looking >>> over the troubleshooting >>> section [4] I could see reasonable value in collecting performance >>> metrics for measures around operations retries, latencies, error rates, >>> etc. though I guess that the distributed >>> tracing is their main selling point. The documentation advertises low >>> footprint too. >>> >>> Opentracing is focusing on providing a standard for distributed tracing >>> for both service and application level. No backend provided OOB afaik but >>> it seems it's covered quite >>> extensively by existing backends such as Zipkin, CNCF Jaeger and more >>> [5]. There specification documentation [6] is very comprehensive. >>> >>> Oh and there is the OpenMetrics [7] too which aims to standardize on how >>> we expose metrics. I am learning a lot over of interesting things from >>> their issues page [8] >>> >>> Then there is the good old codahale/dropwizard metrics library [9] that >>> we could leverage just as well to expose internal metrics from the library, >>> no potential distributed tracing support though. >>> I don't think that DW metrics supports tags though, reading [10] it >>> seems they're looking at it as a breaking change and engineering team is >>> looking to add tags support in version 5.0. >>> >>> I am thinking that distributed tracing might prove very useful for >>> troubleshooting operations that require atomic guarantees. >>> I am thinking/ hoping that should any backend we'd use for implementing >>> Iceberg be using either opencensus or opentracing we might get support of >>> distributed tracing, it'd be really interesting >>> to see spanning across process boundaries. >>> >>> I am saying a lot of "hoping" and "thinking" because I haven't used >>> either one in a real-world implementation but I thought I'd might get folks >>> interested on the topic and something good comes out of this. >>> >>> [1] https://opencensus.io/introduction/ >>> https://opensource.google.com/projects/opencensus >>> [2] https://opencensus.io/language-support/ >>> [3] https://opencensus.io/introduction/#backend-support >>> [4] https://opencensus.io/advanced-concepts/troubleshooting/ >>> [5] https://opentracing.io/docs/supported-tracers/ >>> [6] https://opentracing.io/specification/ >>> [7] https://openmetrics.io/ >>> [8] https://github.com/OpenObservability/OpenMetrics/issues >>> [9] https://metrics.dropwizard.io/4.0.0/ >>> [10] https://github.com/dropwizard/metrics/issues/1175 >>> >>> >>> On Mon, Feb 18, 2019 at 11:03 PM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> I don't know. Can you elaborate on what opencensus and opentracing are? >>>> >>>> On Mon, Feb 18, 2019 at 12:51 PM filip <filip....@gmail.com> wrote: >>>> >>>>> >>>>> /Filip >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >>> -- >>> Filip Bocse >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Filip Bocse > -- Ryan Blue Software Engineer Netflix