I'm not sure what I would want from DropWizard metrics. Most of the things
we want to time happen just a few times in a job and are specific to a
table.

For example, we want to know how long a particular query takes to plan.
That is dependent on how large the table is and what filters were applied.
That's why we've added a way to register listeners that can log those scan
events for later analysis.

I think I would continue with this approach rather than adding a metrics
library. The events that we want to time have to be grouped by table and
need to be gathered from many runs of a job or a query. So it makes more
sense to improve the events that are generated and the data those events
contain.

rb

On Mon, Feb 25, 2019 at 1:42 AM filip <filip....@gmail.com> wrote:

> +1 on the distributed tracing, no obvious integration points.
> Dropwizard metrics should suffice wrt to functional requirements, after
> all it does work for Spark [1], right? Wrt to your ask on choosing and
> established and reasonable dependencies set dependency I think Dropwizard
> is the only option, no runner up afaik.
> While we can rely on metrics that are specific to a particular Iceberg
> implementation (i.e. Hadoop) there's still some interesting metrics I'd
> consider more than nice-to-have tbh, like histograms of table operations
> latencies, since for example an Iceberg file append commit operation may
> consist of a up to a dozen effective Hadoop filesystem operations.
> You have the experience of running Iceberg in production so I was looking
> for advice on say top three metrics that you'd strongly consider before
> running Iceberg in production?
>
> [1] https://spark.apache.org/docs/latest/monitoring.html
>
> On Thu, Feb 21, 2019 at 11:26 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Sounds like one of the first decision points is whether to use a
>> framework with distributed tracing or not. I think I would opt for not
>> requiring distributed tracing.
>>
>> Most of Iceberg is a self-contained library, so there are few points at
>> which distributed tracing would make sense. Is there much value in tracing
>> the metadata swap that happens in a metastore? I'm not sure there is. I
>> think it would probably be sufficient to use a simpler metrics library.
>>
>> I've used DropWizard before, which I thought was trying to be the SLF4J
>> of metrics. Is that still the case? I'd prefer to go with an established
>> project that is likely to have broad support. And one that has a reasonable
>> dependency set.
>>
>> On Mon, Feb 18, 2019 at 2:33 PM filip <filip....@gmail.com> wrote:
>>
>>> Both these solutions provide support for collecting metrics and
>>> distributed tracing independent of the platform of choice. They seem to be
>>> overlapping quite a lot though.
>>>
>>> OpenCensus [1] provides bindings for Go, Java, C++ and more [2] and it
>>> also seems to support OOB backends and custom ones as well [3]. Looking
>>> over the troubleshooting
>>> section [4] I could see reasonable value in collecting performance
>>> metrics for measures around operations retries, latencies, error rates,
>>> etc. though I guess that the distributed
>>> tracing is their main selling point. The documentation advertises low
>>> footprint too.
>>>
>>> Opentracing is focusing on providing a standard for distributed tracing
>>> for both service and application level. No backend provided OOB afaik but
>>> it seems it's covered quite
>>> extensively by existing backends such as Zipkin, CNCF Jaeger and more
>>> [5]. There specification documentation [6] is very comprehensive.
>>>
>>> Oh and there is the OpenMetrics [7] too which aims to standardize on how
>>> we expose metrics. I am learning a lot over of interesting things from
>>> their issues page [8]
>>>
>>> Then there is the good old codahale/dropwizard metrics library [9] that
>>> we could leverage just as well to expose internal metrics from the library,
>>> no potential distributed tracing support though.
>>> I don't think that DW metrics supports tags though, reading [10] it
>>> seems they're looking at it as a breaking change and engineering team is
>>> looking to add tags support in version 5.0.
>>>
>>> I am thinking that distributed tracing might prove very useful for
>>> troubleshooting operations that require atomic guarantees.
>>> I am thinking/ hoping that should any backend we'd use for implementing
>>> Iceberg be using either opencensus or opentracing we might get support of
>>> distributed tracing, it'd be really interesting
>>> to see spanning across process boundaries.
>>>
>>> I am saying a lot of "hoping" and "thinking" because I haven't used
>>> either one in a real-world implementation but I thought I'd might get folks
>>> interested on the topic and something good comes out of this.
>>>
>>> [1] https://opencensus.io/introduction/
>>> https://opensource.google.com/projects/opencensus
>>> [2] https://opencensus.io/language-support/
>>> [3] https://opencensus.io/introduction/#backend-support
>>> [4] https://opencensus.io/advanced-concepts/troubleshooting/
>>> [5] https://opentracing.io/docs/supported-tracers/
>>> [6] https://opentracing.io/specification/
>>> [7] https://openmetrics.io/
>>> [8] https://github.com/OpenObservability/OpenMetrics/issues
>>> [9] https://metrics.dropwizard.io/4.0.0/
>>> [10] https://github.com/dropwizard/metrics/issues/1175
>>>
>>>
>>> On Mon, Feb 18, 2019 at 11:03 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I don't know. Can you elaborate on what opencensus and opentracing are?
>>>>
>>>> On Mon, Feb 18, 2019 at 12:51 PM filip <filip....@gmail.com> wrote:
>>>>
>>>>>
>>>>> /Filip
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Filip Bocse
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Filip Bocse
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to