Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

Matthias J. Sax Mon, 09 Sep 2019 17:17:46 -0700

Works for me.

(Btw: I did not really vote, just stated my overall support :P )


On 9/5/19 2:50 PM, John Roesler wrote:
> Thanks, all.
> 
> FWIW, the most recent formulation from Guozhang + Bruno's addendum would
> have my support.
> 
> Thanks,
> -John
> 
> On Thu, Sep 5, 2019 at 4:05 PM Bruno Cadonna <br...@confluent.io> wrote:
> 
>> Hi Guozhang,
>>
>> Your summary corresponds to my proposal.
>>
>> A new value would only be added if in future we change the metrics in
>> a backward-incompatible way, i.e., 2.4-<the version before the new
>> breaking change>. "latest" will always stay the default.
>>
>> Best,
>> Bruno
>>
>> On Thu, Sep 5, 2019 at 10:57 PM Guozhang Wang <wangg...@gmail.com> wrote:
>>>
>>> Hello Bruno,
>>>
>>> I think your concern makes sense, let's adopt this suggestion in KIP-444
>>> instead. Just to clarify:
>>>
>>> 1. The default value would be "latest".
>>> 2. The only other valid value is "0.10.0-2.3".
>>>
>>> And moving forward this config may stay without any new values.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Thu, Sep 5, 2019 at 12:16 PM Bruno Cadonna <br...@confluent.io>
>> wrote:
>>>
>>>> Hi Guozhang,
>>>>
>>>> I think user experience and code maintenance are tightly related. The
>>>> harder to maintain the code the worse the user experience will get.
>>>>
>>>> Making the config optional does not solve the issue. Wouldn't users be
>>>> puzzled when we release 2.5 and they cannot set
>>>> built.in.metrics.version to 2.4 to be sure to get the same metrics for
>>>> that version? It seems with that solution we would just move
>>>> maintenance to the next release.
>>>>
>>>> I cannot see how the values `latest` and `0.10.0-2.3` would lead to a
>>>> bad user experience.
>>>>
>>>> Regarding testing, at least on integration test level, we absolutely
>>>> need to test all versions. It is too easy to make a mistake with so
>>>> many versions. Remember that on integration test level we need to
>>>> start an embedded Kafka for each single test.
>>>>
>>>> Best,
>>>> Bruno
>>>>
>>>> On Thu, Sep 5, 2019 at 8:46 PM Guozhang Wang <wangg...@gmail.com>
>> wrote:
>>>>>
>>>>> Hi Bruno,
>>>>>
>>>>> Thanks for raising this point. I think the main motivation behind
>> this
>>>>> proposal is, like Matthias said, to ease the understanding burden
>> from
>>>>> users to our own shoulders. Testing wise, I think we do not
>> necessarily
>>>>> need to explode the testing matrix but just test the last version
>> before
>>>>> each metrics refactoring (again, hopefully it is the only time) and
>>>> hence I
>>>>> think it worth benefiting user's experience. WDYT?
>>>>>
>>>>> Hi Matthias,
>>>>>
>>>>> Thanks for your feedback, I will update the wiki page accordingly.
>>>>>
>>>>> Will also close the other voting thread with your vote.
>>>>>
>>>>> Guozhang
>>>>>
>>>>> On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax <
>> matth...@confluent.io>
>>>>> wrote:
>>>>>
>>>>>> I share Bruno's concern about future releases, however, I would
>> make
>>>>>> slightly different proposal.
>>>>>>
>>>>>> Instead of using "latest" we can just make the config optional and
>> if
>>>>>> not set, we use the new metrics code? This way we don't need to
>> add a
>>>>>> new version number each time we do a new release (note, that it
>> would
>>>> be
>>>>>> weird to keep default value "2.4" in future releases).
>>>>>>
>>>>>> For enabling backward compatibility: I don't have a strong opinion
>> if
>>>> we
>>>>>> should have a single value "0.10.0-2.3" or list each version
>>>> individually.
>>>>>>
>>>>>> In KIP-268 (fixing metadata upgrade) we decided to list each
>> version
>>>>>> individually as it seems simpler for users. Also, we wanted to hide
>>>>>> which release uses which metadata version (v0 in 0.10.0, and v1 in
>>>>>> 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a
>> single
>>>>>> value though but it seemed not to give best user experience.
>>>>>>
>>>>>> I think this KIP is a little different though and both options
>> seems to
>>>>>> be valid. However, I would like to emphasize that we should
>> optimize
>>>> for
>>>>>> user experience (and not if it's harder/easier to test etc---in
>> doubt,
>>>>>> we should always take on the burden if is helps to lift the burden
>> from
>>>>>> users).
>>>>>>
>>>>>> Overall, I am +1
>>>>>>
>>>>>> Some nits:
>>>>>>
>>>>>> (1) I think the motivation section for updating `StreamsMetrics`
>>>>>> interface does not make it clear why we need the change. What is
>> the
>>>>>> issue with the current interface and how do the new method address
>> the
>>>>>> issue
>>>>>>
>>>>>> (2) The metric name `put | put-if-absent .. | get-latency (avg |
>> max)`
>>>>>> is hard to read because is indicate that there is a `get-latency`
>>>> method
>>>>>> call on stores -- can we update it to
>>>>>>
>>>>>> `(put | put-if-absent .. | get)-latency (avg | max)`
>>>>>>
>>>>>> (3) typo: `When users override it to "2.2" or below,` this should
>> be
>>>>>> "2.3" -- or maybe even different if Bruno's concern gets addressed.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/4/19 12:26 PM, Bruno Cadonna wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am sorry to restart the discussion here, but I came across a
>> small
>>>>>>> issue in the KIP.
>>>>>>>
>>>>>>> I started to implement KIP-444 and I am bit concerned about the
>>>> values
>>>>>>> for the the config `built.in.metrics.version`. In the KIP the
>>>> possible
>>>>>>> values are specified as all Kafka Streams versions. I think that
>> this
>>>>>>> set of values is really hard to maintain in the code and it also
>>>> blows
>>>>>>> up the testing burden unnecessarily because all versions need to
>> be
>>>>>>> tested. My proposal (backed by John) is to use the following
>> values:
>>>>>>> - `latest` for the latest version of the metrics
>>>>>>> - `0.10.0-2.3` for the version before `latest`
>>>>>>> If in future, let's say in version 4.1, we need again to change
>> the
>>>>>>> metrics, we would add `2.4-4.0` to the values of the config. With
>>>>>>> major versions, we could also get rid of some values.
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>> You can also have a look at the PR
>>>>>>> https://github.com/apache/kafka/pull/7279 to see this in code.
>>>>>>>
>>>>>>> Best,
>>>>>>> Bruno
>>>>>>>
>>>>>>> On Tue, Aug 20, 2019 at 8:29 PM Guozhang Wang <
>> wangg...@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello Bruno,
>>>>>>>>
>>>>>>>> I've updated the wiki page again per your comments, here's a
>> brief
>>>>>> summary:
>>>>>>>>
>>>>>>>> 1. added the list of removed metrics.
>>>>>>>> 2. added a task-level INFO metric "dropped-records" that covers
>> all
>>>>>>>> scenarios and merges in the existing "late-records-drop",
>>>>>>>> "skipped-records", and "expired-window-records-drop".
>>>>>>>> 3. renamed the util functions of StreamsMetrics as
>>>> `addLatencyRateTotal`
>>>>>>>> and `addRateTotal` sensors.
>>>>>>>>
>>>>>>>>
>>>>>>>> Since I feel it has incorporated all of your comments I'm going
>> to
>>>> start
>>>>>>>> the vote thread for this KIP now.
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 20, 2019 at 9:59 AM Guozhang Wang <
>> wangg...@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Bruno,
>>>>>>>>>
>>>>>>>>> No it was not intentional, and we can definitely add the total
>>>> amount
>>>>>>>>> sensor as well -- they are just util functions to save users
>> some
>>>>>> lines of
>>>>>>>>> code anyways, and should be straightforward.
>>>>>>>>>
>>>>>>>>> Guozhang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 20, 2019 at 1:05 AM Bruno Cadonna <
>> br...@confluent.io>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guozhang,
>>>>>>>>>>
>>>>>>>>>> I totally missed the total invocation count metric in the
>> javadoc.
>>>>>>>>>> Which brings me to a follow-up question. Should the names of
>> the
>>>>>>>>>> methods reflect the included total invocation count? We have
>> to
>>>> rename
>>>>>>>>>> them anyways. One option would be to simply add `Total` to the
>>>> method
>>>>>>>>>> names, i.e., `addLatencyAndRateAndTotalSensor` and
>>>>>>>>>> `addRateAndTotalSensor` (alternatively without the `And`s).
>> Since
>>>>>>>>>> those sensors record exclusively invocations, another option
>>>> would be
>>>>>>>>>> `addInvocationSensor` and `addInvocationSensorWithoutLatency`.
>>>>>>>>>>
>>>>>>>>>> As far as I can see, we have sensors to record invocations but
>>>> none to
>>>>>>>>>> record amounts. Is that intentional? No need to add it to this
>>>> KIP, I
>>>>>>>>>> am just curious.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Bruno
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 20, 2019 at 1:13 AM Guozhang Wang <
>> wangg...@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Bruno,
>>>>>>>>>>>
>>>>>>>>>>> Just realized that for `addRateSensor` and
>>>> `addLatencyAndRateSensor`
>>>>>>>>>> we've
>>>>>>>>>>> actually added the total invocation metric already.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Guozhang
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang <
>>>> wangg...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Bruno,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna <
>>>> br...@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Guozhang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I left my comments inline.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang <
>>>> wangg...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Bruno,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the feedbacks, replied inline.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna <
>>>> br...@confluent.io>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Guozhang,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for the KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) As far as I understand, the StreamsMetrics interface
>> is
>>>> there
>>>>>>>>>> for
>>>>>>>>>>>>>>> user-defined processors. Would it make sense to also add
>> a
>>>>>>>>>> method to
>>>>>>>>>>>>>>> the interface to specify a sensor that records skipped
>>>> records?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure I follow.. if users want to add a specific
>> skipped
>>>>>>>>>> records
>>>>>>>>>>>>>> sensor, she can still do that as a "throughput" sensor
>> via "
>>>>>>>>>>>>>> addThroughputSensor" and then "record" right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As an after-thought, maybe it's better to rename
>> `throughput`
>>>> to
>>>>>>>>>> `rate`
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the public APIs since it is really meant for the latter
>>>> semantics.
>>>>>>>>>> I did
>>>>>>>>>>>>>> not change it just to make less API changes / deprecate
>> fewer
>>>>>>>>>> functions.
>>>>>>>>>>>>>> But if we feel it is important we can change it as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see now that a user can record the rate of skipped
>> records.
>>>>>>>>>> However,
>>>>>>>>>>>>> I was referring to the total number of skipped records.
>> Maybe
>>>> my
>>>>>>>>>>>>> question should be more general: should we allow the user
>> to
>>>> also
>>>>>>>>>>>>> specify sensors for totals or combinations of rate and
>> totals?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sounds good to me, I will add it to the wiki page as well
>> for
>>>>>>>>>>>> StreamsMetrics.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding the naming, I like `rate` more than `throughput`,
>>>> but I
>>>>>>>>>>>>> would not fight for it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) What are the semantics of active-task-process and
>>>>>>>>>>>>> standby-task-process
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ah good catch, I think I made it in the wrong column.
>> Just
>>>> some
>>>>>>>>>>>>>> explanations here: Within a thread's looped iterations, it
>>>> will
>>>>>>>>>> first
>>>>>>>>>>>>> try
>>>>>>>>>>>>>> to process some records from the active tasks, and then
>> see if
>>>>>>>>>> there are
>>>>>>>>>>>>>> any standby-tasks that can be processed as well (i.e. just
>>>> reading
>>>>>>>>>> from
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> restore consumer and apply to the local stores). The ratio
>>>> metrics
>>>>>>>>>> are
>>>>>>>>>>>>> for
>>>>>>>>>>>>>> indicating 1) what tasks (active or standby) does this
>> thread
>>>> own
>>>>>>>>>> so
>>>>>>>>>>>>> far,
>>>>>>>>>>>>>> and 2) how much time in percentage does it spend on each
>> of
>>>> them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But this metric should really be a task-level one that
>>>> includes
>>>>>>>>>> both the
>>>>>>>>>>>>>> thread-id and task-id, and upon task migrations they will
>> be
>>>>>>>>>> dynamically
>>>>>>>>>>>>>> deleted / (re)-created. For each task-id it may be owned
>> by
>>>>>>>>>> multiple
>>>>>>>>>>>>>> threads as one active and others standby, and hence the
>>>> separation
>>>>>>>>>> of
>>>>>>>>>>>>>> active / standby seems still necessary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Makes sense.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3) How do dropped-late-records and
>> expired-window-record-drop
>>>>>>>>>> relate
>>>>>>>>>>>>>>> to each other? I guess the former is for records that
>> fall
>>>>>>>>>> outside the
>>>>>>>>>>>>>>> grace period and the latter is for records that are
>> processed
>>>>>>>>>> after
>>>>>>>>>>>>>>> the retention period of the window. Is this correct?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, that's correct. The names are indeed a bit confusing
>>>> since
>>>>>>>>>> they
>>>>>>>>>>>>> are
>>>>>>>>>>>>>> added at different releases historically..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> More precisely, the `grace period` is a notion of the
>> operator
>>>>>>>>>> (hence
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> metric is node-level, though it would only be used for DSL
>>>>>>>>>> operators)
>>>>>>>>>>>>> while
>>>>>>>>>>>>>> the `retention` is a notion of the store (hence the
>> metric is
>>>>>>>>>>>>> store-level).
>>>>>>>>>>>>>> Usually grace period will be smaller than store retention
>>>> though.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Processor node is aware of `grace period` and when
>> received a
>>>>>>>>>> record
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> is older than grace deadline, it will be dropped
>> immediately;
>>>>>>>>>> otherwise
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> will still be processed a maybe a new update is "put"
>> into the
>>>>>>>>>> store.
>>>>>>>>>>>>> The
>>>>>>>>>>>>>> store is aware of its `retention period` and then upon a
>> "put"
>>>>>>>>>> call if
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> realized it is older than the retention deadline, that put
>>>> call
>>>>>>>>>> would be
>>>>>>>>>>>>>> ignored and metric is recorded.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have to separate them here since the window store can
>> be
>>>> used
>>>>>>>>>> in both
>>>>>>>>>>>>>> DSL and PAPI, and for the former case it would likely to
>> be
>>>>>> already
>>>>>>>>>>>>> ignored
>>>>>>>>>>>>>> at the processor node level due to the grace period which
>> is
>>>>>>>>>> usually
>>>>>>>>>>>>>> smaller than retention; but for PAPI there's no grace
>> period
>>>> and
>>>>>>>>>> hence
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> processor would likely still process and call "put" on the
>>>> store.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alright! Got it!
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4) Is there an actual difference between skipped and
>> dropped
>>>>>>>>>> records?
>>>>>>>>>>>>>>> If not, shall we unify the terminology?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is. Dropped records are only due to lateness; where
>> as
>>>>>>>>>> skipped
>>>>>>>>>>>>>> records can be due to serde errors (and user's error
>> handling
>>>>>>>>>> indicate
>>>>>>>>>>>>>> "skip and continue"), timestamp errors, etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've considered maybe a better (more extensible) way
>> would be
>>>>>>>>>> defining a
>>>>>>>>>>>>>> single metric name, say skipped-records, but use different
>>>> tags to
>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>> if its skipping reason (errors, windowing semantics,
>> etc). But
>>>>>>>>>> there's
>>>>>>>>>>>>>> still a tricky difference: for serde caused skipping for
>>>> example,
>>>>>>>>>> they
>>>>>>>>>>>>> will
>>>>>>>>>>>>>> be skipped at the very beginning and there's no effects
>> taken
>>>> at
>>>>>>>>>> all.
>>>>>>>>>>>>> For
>>>>>>>>>>>>>> some others e.g. null-key / value at the reduce operator,
>> it
>>>> is
>>>>>>>>>> only
>>>>>>>>>>>>>> skipped at the middle of the processing, i.e. some
>> effects may
>>>>>> have
>>>>>>>>>>>>> already
>>>>>>>>>>>>>> been taken in up-stream sub-topologies. And that's why for
>>>>>>>>>>>>> skipped-records
>>>>>>>>>>>>>> I've defined it on both task-level and node-level and the
>>>>>>>>>> aggregate of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> latter may still be smaller than the former, whereas for
>>>>>>>>>>>>> dropped-records it
>>>>>>>>>>>>>> is only for node-level.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So how about an even more significant change then: we
>> enlarge
>>>> the
>>>>>>>>>>>>>> `dropped-late-records` to `dropped-records` which is
>>>> node-level
>>>>>>>>>> only,
>>>>>>>>>>>>> but
>>>>>>>>>>>>>> includes reasons form lateness to semantics (like
>> null-key) as
>>>>>>>>>> well; and
>>>>>>>>>>>>>> then we have a task-level-only `skipped-records` which
>> only
>>>> record
>>>>>>>>>> those
>>>>>>>>>>>>>> dropped at the very beginning and did not make it at all
>> to
>>>> the
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>> topology. I feel this is a clearer distinguishment but
>> also a
>>>>>>>>>> bigger
>>>>>>>>>>>>> change
>>>>>>>>>>>>>> to users.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like the way you dropped-records and skipped-records are
>> now
>>>>>>>>>>>>> defined. My follow-up question is whether we should give
>> names
>>>> to
>>>>>>>>>>>>> those metrics that better describe their semantics, like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> dropped-records-at-source and dropped-records-at-processor
>>>>>>>>>>>>>
>>>>>>>>>>>>> or
>>>>>>>>>>>>>
>>>>>>>>>>>>> records-dropped-at-source and records-dropped-at-processor
>>>>>>>>>>>>>
>>>>>>>>>>>>> or
>>>>>>>>>>>>>
>>>>>>>>>>>>> source-dropped-records and processor-dropped-records
>>>>>>>>>>>>>
>>>>>>>>>>>>> or alternatively with skipped. However, I would use the
>> same
>>>> term
>>>>>> as
>>>>>>>>>>>>> in expired-window-record-drop
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe, we should also consider to rename
>>>> expired-window-record-drop
>>>>>>>>>> to
>>>>>>>>>>>>> expired-window-record-dropped to be consistent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was not considering "expired-window-record-drop" before
>>>> since it
>>>>>>>>>> is a
>>>>>>>>>>>> store-level metric, and I was only considering task-level
>>>>>>>>>> (skipped-records)
>>>>>>>>>>>> and processor-node-level (dropped-records) metrics, and I'm
>>>> using
>>>>>>>>>> different
>>>>>>>>>>>> terms deliberately to hint users that they are different
>> leveled
>>>>>>>>>> metrics.
>>>>>>>>>>>>
>>>>>>>>>>>> I still feel that using `skip` for task-level metrics
>> indicating
>>>>>> that
>>>>>>>>>> this
>>>>>>>>>>>> record was not processed at all, and using `drop` for
>>>>>> processor-level
>>>>>>>>>>>> metrics that this record is only dropped at this stage of
>> the
>>>>>>>>>> topology is a
>>>>>>>>>>>> better one; but I'm also okay with some finer grained
>> metrics so
>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>> can align the processor-level with store-level (they are on
>> the
>>>> same
>>>>>>>>>>>> granularity any ways), like:
>>>>>>>>>>>>
>>>>>>>>>>>> `dropped-records-null-field`: at processor nodes
>>>>>>>>>>>>
>>>>>>>>>>>> `dropped-records-too-late`: at processor nodes
>>>>>>>>>>>>
>>>>>>>>>>>> `dropped-records-expired-window`: at window-stores
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 5) What happens with removed metrics when the user sets
>> the
>>>>>>>>>> version of
>>>>>>>>>>>>>>> "built.in.metrics.version" to 2.2-
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think for those redundant ones like ""forward-rate" and
>>>>>>>>>>>>> "destroy-rate"
>>>>>>>>>>>>>> we can still remove them with 2.2- as well; for other ones
>>>> that
>>>>>> are
>>>>>>>>>>>>> removed
>>>>>>>>>>>>>> / replaced like thread-level skipped-records we should
>> still
>>>>>>>>>> maintain
>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you add this comment about removal of redundant
>> metrics
>>>> to
>>>>>> the
>>>>>>>>>>>>> KIP such that is documented somewhere?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, for sure.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Bruno
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I've also decided to remove the rebalance-related metrics
>> from
>>>> the
>>>>>>>>>>>> instance-level and move it to consumer itself as part of
>>>> KIP-429.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> -- Guozhang
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> -- Guozhang
>>>>
>>>
>>>
>>> --
>>> -- Guozhang
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

Reply via email to