Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Sagar Sun, 10 Sep 2023 21:16:17 -0700

Thanks Chris,

The changes look good to me.


1) Regarding the suggestion to reduce the key sizes, the only intent was to
make it easier to read. But then I missed the fact that the
"org.apache.kafka.connect" isn't always going to be the prefix for these
keys. We can live with whatever we have

2) Hmm, I think it just felt like a useful extension to the current
mechanism of changing log levels per worker. One place where it might come
in handy, and which can't be handled by any of the options listed in Future
Work sections, is if somebody wants to observe the rebalance related
activities per worker on a subset of them using finer grained logs. I am
not sure if it's a strong enough motivation but as I said it just felt like
a useful extension. I will leave it to you if you want to add it or not (I
am ok either way).

Thanks!
Sagar.

On Thu, Sep 7, 2023 at 9:26 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

> Hi all,
>
> Thanks again for the reviews!
>
>
> Sagar:
>
> > The updated definition of last_modified looks good to me. As a
> continuation
> to point number 2, could we also mention that this could be used to get
> insights into the propagation of the cluster wide log level updates. It is
> implicit but probably better to add it I feel.
>
> Sure, done. Added to the end of the "Config topic records" section: "There
> may be some delay between when a REST request with scope=cluster is
> received and when all workers have read the corresponding record from the
> config topic. The last modified timestamp (details above) can serve as a
> rudimentary tool to provide insight into the propagation of a cluster-wide
> log level adjustment."
>
> > Yeah I would lean on the side of calling it out explicitly. Since the
> behaviour is similar to the current dynamically set log levels (i.e
> resetting to the log4j config files levels) so we can call it out stating
> that similarity just for completeness sake. It would be useful info for
> new/medium level users reading the KIP considering worker restarts is not
> uncommon.
>
> Alright, did this too. Added near the end of the "Config topic records"
> section: "Restarting a worker will cause it to discard all cluster-wide
> dynamic log level adjustments, and revert to the levels specified in its
> Log4j configuration. This mirrors the current behavior with per-worker
> dynamic log level adjustments."
>
> > I had a nit level suggestion but not sure if it makes sense but would
> still
> call it out. The entire namespace name in the config records key (along
> with logger-cluster prefix) seems to be a bit too verbose. My first thought
> was to not have the prefix org.apache.kafka.connect in the keys considering
> it is the root namespace. But since logging can be enabled at a root level,
> can we just use initials like (o.a.k.c) which is also a standard practice.
> Let me know what you think?
>
> Considering these records aren't meant to be user-visible, there doesn't
> seem to be a pressing need to reduce their key sizes (though I'll happily
> admit that to human eyes, the format is a bit ugly). IMO the increase in
> implementation complexity isn't quite worth it, especially considering
> there are plenty of logging namespaces that won't begin with
> "org.apache.kafka.connect" (likely including all third-party connector
> code), like Yash mentions. Is there a motivation for this suggestion that
> I'm missing?
>
> > Lastly, I was also thinking if we could introduce a new parameter which
> takes a subset of worker ids and enables logging for them in one go. But
> this is already achievable by invoking scope=worker endpoint n times to
> reflect on n workers so maybe not a necessary change. But this could be
> useful on a large cluster. Do you think this is worth listing in the Future
> Work section? It's not important so can be ignored as well.
>
> Hmmm... I think I'd rather leave this out for now because I'm just not
> certain enough it'd be useful. The one advantage I can think of is
> targeting specific workers that are behind a load balancer, but being able
> to identify those workers would be a challenge in that scenario anyways.
> Besides that, are there any cases that couldn't be addressed more
> holistically by targeting based on connector/connector type, like Yash
> asks?
>
>
> Ashwin:
>
> Glad we're on the same page RE request forwarding and integration vs.
> system tests! Let me know if anything else comes up that you'd like to
> discuss.
>
>
> Yash:
>
> Glad that it makes sense to keep these changes ephemeral. I'm not quite
> confident enough to put persistent updates in the "Future work" section but
> have a sneaking suspicion that this isn't the last we'll see of that
> request. Time will tell...
>
>
> Thanks again, all!
>
> Cheers,
>
> Chris
>
> On Wed, Sep 6, 2023 at 8:36 AM Yash Mayya <yash.ma...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Thanks for the clarification on the last modified timestamp tracking here
> > and on the KIP, things look good to me now.
> >
> > On the persistence front, I hadn't considered the interplay between
> levels
> > set by the log level REST APIs and those set by the log4j configuration
> > files, and the user confusion that could arise from it. I think your
> > argument for keeping the changes made by these log level admin endpoints
> > ephemeral makes sense, thanks!
> >
> > Hi Sagar,
> >
> > > The entire namespace name in the config
> > > records key (along with logger-cluster prefix)
> > > seems to be a bit too verbose. My first
> > > thought was to not have the prefix
> > > org.apache.kafka.connect in the keys
> > > considering it is the root namespace. But
> > > since logging can be enabled at a root level,
> > > can we just use initials like (o.a.k.c) which is
> > > also a standard practice.
> >
> > We allow modifying the log levels for any namespace - i.e. even packages
> > and classes outside of Kafka Connect itself (think connector classes,
> > dependencies etc.). I'm not sure I follow how we'd avoid naming clashes
> > without using the fully qualified name?
> >
> > > I was also thinking if we could introduce a
> > >  new parameter which takes a subset of
> > > worker ids and enables logging for them in
> > > one go
> >
> > The future section already talks about potentially introducing new scopes
> > such as a specific connector or a specific connector plugin. Are there
> any
> > use cases apart from these that would be satisfied by allowing changing
> the
> > log levels on a subset of workers in a Kafka Connect cluster?
> >
> > Thanks,
> > Yash
> >
> > On Wed, Sep 6, 2023 at 7:41 AM Sagar <sagarmeansoc...@gmail.com> wrote:
> >
> > > Hey Chris,
> > >
> > > Thanks for making the updates.
> > >
> > > The updated definition of last_modified looks good to me. As a
> > continuation
> > > to point number 2, could we also mention that this could be used to get
> > > insights into the propagation of the cluster wide log level updates. It
> > is
> > > implicit but probably better to add it I feel.
> > >
> > > Regarding
> > >
> > > It's a little indirect on the front of worker restart behavior, so let
> me
> > > > know if that especially should be fleshed out a bit more (possibly by
> > > > calling it out in the "Config topic records" section).
> > >
> > >
> > > Yeah I would lean on the side of calling it out explicitly. Since the
> > > behaviour is similar to the current dynamically set log levels (i.e
> > > resetting to the log4j config files levels) so we can call it out
> stating
> > > that similarity just for completeness sake. It would be useful info for
> > > new/medium level users reading the KIP considering worker restarts is
> not
> > > uncommon.
> > >
> > >
> > > Thanks, I'm glad that this seems reasonable. I've tentatively added
> this
> > to
> > > > the rejected alternatives section, but am happy to continue the
> > > > conversation if anyone wants to explore that possibility further.
> > >
> > >
> > > +1
> > >
> > > I had a nit level suggestion but not sure if it makes sense but would
> > still
> > > call it out. The entire namespace name in the config records key (along
> > > with logger-cluster prefix) seems to be a bit too verbose. My first
> > thought
> > > was to not have the prefix org.apache.kafka.connect in the keys
> > considering
> > > it is the root namespace. But since logging can be enabled at a root
> > level,
> > > can we just use initials like (o.a.k.c) which is also a standard
> > practice.
> > > Let me know what you think?
> > >
> > > Lastly, I was also thinking if we could introduce a new parameter which
> > > takes a subset of worker ids and enables logging for them in one go.
> But
> > > this is already achievable by invoking scope=worker endpoint n times to
> > > reflect on n workers so maybe not a necessary change. But this could be
> > > useful on a large cluster. Do you think this is worth listing in the
> > Future
> > > Work section? It's not important so can be ignored as well.
> > >
> > > Thanks!
> > > Sagar.
> > >
> > >
> > > On Wed, Sep 6, 2023 at 12:08 AM Chris Egerton <chr...@aiven.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Sagar,
> > > >
> > > > Thanks for your thoughts! Responses inline:
> > > >
> > > > > 1) Considering that you have now clarified here that last_modified
> > > field
> > > > would be stored in-memory, it is not mentioned in the KIP. Although
> > > that's
> > > > something that's understandable, it wasn't apparent when reading the
> > KIP.
> > > > Probably, we could mention it? Also, what happens if a worker
> restarts?
> > > In
> > > > that case, since the log level update message would be pre-dating the
> > > > startup of the worker, it would be ignored? We should probably
> mention
> > > that
> > > > behaviour as well IMO.
> > > >
> > > > I've tweaked the second paragraph of the "Last modified timestamp"
> > > section
> > > > to try to clarify this without getting too verbose: "Modification
> times
> > > > will be tracked in-memory and determined by when they are applied by
> > the
> > > > worker, as opposed to when they are requested by the user or
> persisted
> > to
> > > > the config topic (details below). If no modifications to the
> namespace
> > > have
> > > > been made since the worker finished startup, they will be null." Does
> > > that
> > > > feel sufficient? It's a little indirect on the front of worker
> restart
> > > > behavior, so let me know if that especially should be fleshed out a
> bit
> > > > more (possibly by calling it out in the "Config topic records"
> > section).
> > > >
> > > > > 2) Staying on the last_modified field, it's utility is also not too
> > > clear
> > > > to me. Can you add some examples of how it can be useful for
> debugging
> > > etc?
> > > >
> > > > The cluster-wide API relaxes the consistency guarantees of the
> existing
> > > > worker-local API. With the latter, users can be certain that once
> they
> > > > receive a 2xx response, the logging level on that worker has been
> > > updated.
> > > > With the former, users know that the logging level will eventually be
> > > > updated, but insight into the propagation of that update across the
> > > cluster
> > > > is limited. Although it's a little primitive, I'm hoping that the
> last
> > > > modified timestamp will be enough to help shed some light on this
> > > process.
> > > > We could also explore exposing the provenance of logging levels
> (which
> > > maps
> > > > fairly cleanly to the scope of the request from which they
> originated),
> > > but
> > > > that feels like overkill at the moment.
> > > >
> > > > > 3) In the test plan, should we also add a test when the scope
> > parameter
> > > > passed is non-null and neither worker nor cluster? In this case the
> > > > behaviour should be similar to the default case.
> > > >
> > > > Good call! Done.
> > > >
> > > > > 4) I had the same question as Yash regarding persistent
> cluster-wide
> > > > logging level. I think you have explained it well and we can skip it
> > for
> > > > now.
> > > >
> > > > Thanks, I'm glad that this seems reasonable. I've tentatively added
> > this
> > > to
> > > > the rejected alternatives section, but am happy to continue the
> > > > conversation if anyone wants to explore that possibility further.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Sep 5, 2023 at 11:48 AM Sagar <sagarmeansoc...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Chris,
> > > > >
> > > > > Thanks for the KIP. Seems like a useful feature. I have some more
> > > > > questions/comments:
> > > > >
> > > > > 1) Considering that you have now clarified here that last_modified
> > > field
> > > > > would be stored in-memory, it is not mentioned in the KIP. Although
> > > > that's
> > > > > something that's understandable, it wasn't apparent when reading
> the
> > > KIP.
> > > > > Probably, we could mention it? Also, what happens if a worker
> > restarts?
> > > > In
> > > > > that case, since the log level update message would be pre-dating
> the
> > > > > startup of the worker, it would be ignored? We should probably
> > mention
> > > > that
> > > > > behaviour as well IMO.
> > > > >
> > > > > 2) Staying on the last_modified field, it's utility is also not too
> > > clear
> > > > > to me. Can you add some examples of how it can be useful for
> > debugging
> > > > etc?
> > > > >
> > > > > 3) In the test plan, should we also add a test when the scope
> > parameter
> > > > > passed is non-null and neither worker nor cluster? In this case the
> > > > > behaviour should be similar to the default case.
> > > > >
> > > > > 4) I had the same question as Yash regarding persistent
> cluster-wide
> > > > > logging level. I think you have explained it well and we can skip
> it
> > > for
> > > > > now.
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > > On Tue, Sep 5, 2023 at 8:49 PM Chris Egerton
> <chr...@aiven.io.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Thank you so much for the generous review comments! Happy to see
> > > > interest
> > > > > > in this feature. Inline responses follow.
> > > > > >
> > > > > >
> > > > > > Ashwin:
> > > > > >
> > > > > > > Don't you foresee a future scope type which may require cluster
> > > > > metadata
> > > > > > ?
> > > > > > In that case, isn't it better to forward the requests to the
> leader
> > > in
> > > > > the
> > > > > > initial implementation ?
> > > > > >
> > > > > > I agree with Yash here: we can cross that bridge when we come to
> > it,
> > > > > unless
> > > > > > there are problems that we'd encounter then that would be better
> > > > > addressed
> > > > > > by adding request forwarding now. One potential argument I can
> > think
> > > of
> > > > > is
> > > > > > that the UX would be a little strange if the semantics for this
> > > > endpoint
> > > > > > differ depending on the value of the scope parameter (some values
> > > would
> > > > > be
> > > > > > affected by in-progress rebalances, and some would not), but I
> > don't
> > > > know
> > > > > > if making scope=cluster more brittle in the name of consistency
> > with,
> > > > > e.g.,
> > > > > > scope=connectorType:foo is really a worthwhile tradeoff.
> Thoughts?
> > > > > >
> > > > > > > I would also recommend an additional system test for Standalone
> > > > herder
> > > > > to
> > > > > > ensure that the new scope parameter is honored and the response
> > > > contains
> > > > > > the last modified time.
> > > > > >
> > > > > > Ah, great call! I love the new testing plan section. I also share
> > > > Yash's
> > > > > > concerns about adding a new system test though (at this point,
> > > they're
> > > > so
> > > > > > painful to write, test, and debug that in most circumstances I
> > > consider
> > > > > > them a last resort). Do you think it'd be reasonable to add
> > > end-to-end
> > > > > > verification for this with an integration test instead?
> > > > > >
> > > > > >
> > > > > > Yash:
> > > > > >
> > > > > > > From the proposed changes section, it isn't very clear to me
> how
> > > > we'll
> > > > > be
> > > > > > tracking this last modified timestamp to be returned in responses
> > for
> > > > the
> > > > > > *GET
> > > > > > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints.
> Could
> > > you
> > > > > > please elaborate on this? Also, will we track the last modified
> > > > timestamp
> > > > > > even for worker scoped modifications where we won't write any
> > records
> > > > to
> > > > > > the config topic and the requests will essentially be processed
> > > > > > synchronously?
> > > > > >
> > > > > > RE timestamp tracking: I was thinking we'd store the timestamp
> for
> > > each
> > > > > > level in-memory and, whenever we change the level for a
> namespace,
> > > > update
> > > > > > its timestamp to the current wallclock time. Concretely, this
> means
> > > > we'd
> > > > > > want the timestamp for some logger `logger` to be as soon as
> > possible
> > > > > after
> > > > > > the call to `logger.setLevel(level)` for some level `level`. I'm
> > > > honestly
> > > > > > unsure how to clarify this further in the KIP; is there anything
> in
> > > > there
> > > > > > that strikes you as particularly ambiguous that we can tweak to
> be
> > > more
> > > > > > clear?
> > > > > >
> > > > > > RE scope distinction for timestamps: I've updated the KIP to
> > clarify
> > > > this
> > > > > > point, adding this sentence: "Timestamps will be updated
> regardless
> > > of
> > > > > > whether the namespace update was applied using scope=worker or
> > > > > > scope=cluster.". Let me know what you think
> > > > > >
> > > > > > > In the current synchronous implementation for the *PUT
> > > > > > /admin/loggers/{logger} *endpoint, we return a 404 error if the
> > level
> > > > is
> > > > > > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new
> > > > cluster
> > > > > > scoped variant of the endpoint will be asynchronous, can we also
> > add
> > > a
> > > > > > validation to synchronously surface erroneous log levels to
> users?
> > > > > >
> > > > > > Good call! I think we don't have to be explicit about this in the
> > > > > proposed
> > > > > > changes section, but it's a great fit for the testing plan, where
> > > I've
> > > > > > added it: "Ensure that cluster-scoped requests with invalid
> logging
> > > > > levels
> > > > > > are rejected with a 404 response"
> > > > > >
> > > > > > > I'm curious to know what the rationale here is? In KIP-745, the
> > > > stated
> > > > > > reasoning behind ignoring restart requests during worker startup
> > was
> > > > that
> > > > > > the worker will anyway be starting all connectors and tasks
> > assigned
> > > to
> > > > > it
> > > > > > so the restart request is essentially meaningless. With the log
> > level
> > > > API
> > > > > > however, wouldn't it make more sense to apply any cluster scoped
> > > > > > modifications to new workers in the cluster too? This also
> applies
> > to
> > > > any
> > > > > > workers that are restarted soon after a request is made to *PUT
> > > > > > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe
> we
> > > > could
> > > > > > stack up all the cluster scoped log level modification requests
> > > during
> > > > > the
> > > > > > config topic read at worker startup and apply the latest ones per
> > > > > namespace
> > > > > > (assuming older records haven't already been compacted) after
> we've
> > > > > > finished reading to the end of the config topic?
> > > > > >
> > > > > > It's definitely possible to implement persistent cluster-wide
> > logging
> > > > > level
> > > > > > adjustments, with the possible exception that none will be
> applied
> > > > until
> > > > > > the worker has finished reading to the end of the config topic.
> The
> > > > > > rationale for keeping these updates ephemeral is to continue to
> > give
> > > > > > priority to workers' Log4j configuration files, with the
> underlying
> > > > > > philosophy that this endpoint is still only intended for
> debugging
> > > > > > purposes, as opposed to cluster-wide configuration. Permanent
> > changes
> > > > can
> > > > > > already be made by tweaking the Log4j file for a worker and then
> > > > > restarting
> > > > > > it. If a restart is too expensive for a permanent change, then
> the
> > > > change
> > > > > > can be applied immediately via the REST API, and staged via the
> > Log4j
> > > > > > configuration file (which will then be used the next time the
> > worker
> > > is
> > > > > > restarted, whenever that happens).
> > > > > >
> > > > > > We could potentially explore persistent cluster-wide logging
> > > > adjustments
> > > > > in
> > > > > > the future (or, if people feel strongly, right now), but it'd be
> > more
> > > > > > complex. How would we handle the case where a user has updated
> > their
> > > > > Log4j
> > > > > > configuration file but continues to see the same logging level
> for
> > a
> > > > > > namespace, in a way that greatly reduces or even eliminates the
> > odds
> > > of
> > > > > > headaches or footguns? How would we allow users to reset or undo
> > > > > > cluster-wide configuration updates, in order to revert back to
> > > > whatever's
> > > > > > specified in their Log4j configuration file? Would we want to
> give
> > > > users
> > > > > > control over whether a dynamic update is persistent or ephemeral?
> > > What
> > > > > > would the implications be for finer-grained scoping (especially
> > with
> > > > > > regards to resetting dynamic changes)? Overall it doesn't seem
> > worth
> > > > the
> > > > > > work to tackle all of this right now, as long as we're not
> painting
> > > > > > ourselves into a corner. But as always, happy to hear what you
> and
> > > > others
> > > > > > think!
> > > > > >
> > > > > >
> > > > > > Federico:
> > > > > >
> > > > > > > Due to the idempotent nature of PUT, I guess that the
> > last_modified
> > > > > > timestamp won't change if the same request is repeated
> > successively.
> > > > > > Should we add a unit test for that?
> > > > > >
> > > > > > Great call! I've added this to the testing plan with unit and
> > > > end-to-end
> > > > > > tests (seems fairly cheap to do both; LMK if e2e feels like
> > overkill
> > > > > > though).
> > > > > >
> > > > > >
> > > > > > Thanks again for all of your thoughts!
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <
> > fedeval...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Chris, thanks. This looks like a useful feature.
> > > > > > >
> > > > > > > Due to the idempotent nature of PUT, I guess that the
> > last_modified
> > > > > > > timestamp won't change if the same request is repeated
> > > successively.
> > > > > > > Should we add a unit test for that?
> > > > > > >
> > > > > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin
> > <apan...@confluent.io.invalid
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > >
> > > > > > > > Thanks for thinking about this useful feature !
> > > > > > > > I had a question regarding
> > > > > > > >
> > > > > > > > > Since cluster metadata is not required to handle these
> types
> > of
> > > > > > > request,
> > > > > > > > they will not be forwarded to the leader
> > > > > > > >
> > > > > > > > And later, we also mention about supporting more scope types
> in
> > > the
> > > > > > > future.
> > > > > > > > Don't you foresee a future scope type which may require
> cluster
> > > > > > metadata
> > > > > > > ?
> > > > > > > > In that case, isn't it better to forward the requests to the
> > > leader
> > > > > in
> > > > > > > the
> > > > > > > > initial implementation ?
> > > > > > > >
> > > > > > > > I would also recommend an additional system test for
> Standalone
> > > > > herder
> > > > > > to
> > > > > > > > ensure that the new scope parameter is honored and the
> response
> > > > > > contains
> > > > > > > > the last modified time.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ashwin
> > > > > > > >
> > > > > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton
> > > > <chr...@aiven.io.invalid
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Can't imagine a worse time to publish a new KIP (it's late
> > on a
> > > > > > Friday
> > > > > > > and
> > > > > > > > > we're in the middle of the 3.6.0 release), but I wanted to
> > put
> > > > > forth
> > > > > > > > > KIP-976 for discussion:
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> > > > > > > > >
> > > > > > > > > TL;DR: The API to dynamically adjust log levels at runtime
> > with
> > > > > > > Connect is
> > > > > > > > > great, and I'd like to augment it with support to adjust
> log
> > > > levels
> > > > > > for
> > > > > > > > > every worker in the cluster (instead of just the worker
> that
> > > > > receives
> > > > > > > the
> > > > > > > > > REST request).
> > > > > > > > >
> > > > > > > > > I look forward to everyone's thoughts, but expect that this
> > > will
> > > > > > > probably
> > > > > > > > > take a bump or two once the dust has settled on 3.6.0. Huge
> > > > thanks
> > > > > to
> > > > > > > > > everyone that's contributed to that release so far,
> > especially
> > > > our
> > > > > > > release
> > > > > > > > > manager Satish!
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Reply via email to