Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Ashwin Wed, 06 Sep 2023 02:27:16 -0700

Hi Chris,

> I agree with Yash here: we can cross that bridge when we come to it,
unless
there are problems that we'd encounter then that would be better addressed
by adding request forwarding now. One potential argument I can think of is
that the UX would be a little strange if the semantics for this endpoint
differ depending on the value of the scope parameter (some values would be
affected by in-progress rebalances, and some would not), but I don't know
if making scope=cluster more brittle in the name of consistency with, e.g.,
scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?


Yes, the inconsistency in UX was what drove me to comment. I understand
that this will add complexity (unnecessary for the current implementation)
and am fine with deferring it for later.


> Ah, great call! I love the new testing plan section. I also share Yash's
concerns about adding a new system test though (at this point, they're so
painful to write, test, and debug that in most circumstances I consider
them a last resort). Do you think it'd be reasonable to add end-to-end
verification for this with an integration test instead?

Sounds good to me. Honestly, I did not know that system tests are more
"heavy weight" than integration tests - so thanks for letting me know Yash
& Chris.

Cheers,
Ashwin

On Tue, Sep 5, 2023 at 8:51 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

> Hi all,
>
> Thank you so much for the generous review comments! Happy to see interest
> in this feature. Inline responses follow.
>
>
> Ashwin:
>
> > Don't you foresee a future scope type which may require cluster metadata
> ?
> In that case, isn't it better to forward the requests to the leader in the
> initial implementation ?
>
> I agree with Yash here: we can cross that bridge when we come to it, unless
> there are problems that we'd encounter then that would be better addressed
> by adding request forwarding now. One potential argument I can think of is
> that the UX would be a little strange if the semantics for this endpoint
> differ depending on the value of the scope parameter (some values would be
> affected by in-progress rebalances, and some would not), but I don't know
> if making scope=cluster more brittle in the name of consistency with, e.g.,
> scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?
>
> > I would also recommend an additional system test for Standalone herder to
> ensure that the new scope parameter is honored and the response contains
> the last modified time.
>
> Ah, great call! I love the new testing plan section. I also share Yash's
> concerns about adding a new system test though (at this point, they're so
> painful to write, test, and debug that in most circumstances I consider
> them a last resort). Do you think it'd be reasonable to add end-to-end
> verification for this with an integration test instead?
>
>
> Yash:
>
> > From the proposed changes section, it isn't very clear to me how we'll be
> tracking this last modified timestamp to be returned in responses for the
> *GET
> /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could you
> please elaborate on this? Also, will we track the last modified timestamp
> even for worker scoped modifications where we won't write any records to
> the config topic and the requests will essentially be processed
> synchronously?
>
> RE timestamp tracking: I was thinking we'd store the timestamp for each
> level in-memory and, whenever we change the level for a namespace, update
> its timestamp to the current wallclock time. Concretely, this means we'd
> want the timestamp for some logger `logger` to be as soon as possible after
> the call to `logger.setLevel(level)` for some level `level`. I'm honestly
> unsure how to clarify this further in the KIP; is there anything in there
> that strikes you as particularly ambiguous that we can tweak to be more
> clear?
>
> RE scope distinction for timestamps: I've updated the KIP to clarify this
> point, adding this sentence: "Timestamps will be updated regardless of
> whether the namespace update was applied using scope=worker or
> scope=cluster.". Let me know what you think
>
> > In the current synchronous implementation for the *PUT
> /admin/loggers/{logger} *endpoint, we return a 404 error if the level is
> invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new cluster
> scoped variant of the endpoint will be asynchronous, can we also add a
> validation to synchronously surface erroneous log levels to users?
>
> Good call! I think we don't have to be explicit about this in the proposed
> changes section, but it's a great fit for the testing plan, where I've
> added it: "Ensure that cluster-scoped requests with invalid logging levels
> are rejected with a 404 response"
>
> > I'm curious to know what the rationale here is? In KIP-745, the stated
> reasoning behind ignoring restart requests during worker startup was that
> the worker will anyway be starting all connectors and tasks assigned to it
> so the restart request is essentially meaningless. With the log level API
> however, wouldn't it make more sense to apply any cluster scoped
> modifications to new workers in the cluster too? This also applies to any
> workers that are restarted soon after a request is made to *PUT
> /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we could
> stack up all the cluster scoped log level modification requests during the
> config topic read at worker startup and apply the latest ones per namespace
> (assuming older records haven't already been compacted) after we've
> finished reading to the end of the config topic?
>
> It's definitely possible to implement persistent cluster-wide logging level
> adjustments, with the possible exception that none will be applied until
> the worker has finished reading to the end of the config topic. The
> rationale for keeping these updates ephemeral is to continue to give
> priority to workers' Log4j configuration files, with the underlying
> philosophy that this endpoint is still only intended for debugging
> purposes, as opposed to cluster-wide configuration. Permanent changes can
> already be made by tweaking the Log4j file for a worker and then restarting
> it. If a restart is too expensive for a permanent change, then the change
> can be applied immediately via the REST API, and staged via the Log4j
> configuration file (which will then be used the next time the worker is
> restarted, whenever that happens).
>
> We could potentially explore persistent cluster-wide logging adjustments in
> the future (or, if people feel strongly, right now), but it'd be more
> complex. How would we handle the case where a user has updated their Log4j
> configuration file but continues to see the same logging level for a
> namespace, in a way that greatly reduces or even eliminates the odds of
> headaches or footguns? How would we allow users to reset or undo
> cluster-wide configuration updates, in order to revert back to whatever's
> specified in their Log4j configuration file? Would we want to give users
> control over whether a dynamic update is persistent or ephemeral? What
> would the implications be for finer-grained scoping (especially with
> regards to resetting dynamic changes)? Overall it doesn't seem worth the
> work to tackle all of this right now, as long as we're not painting
> ourselves into a corner. But as always, happy to hear what you and others
> think!
>
>
> Federico:
>
> > Due to the idempotent nature of PUT, I guess that the last_modified
> timestamp won't change if the same request is repeated successively.
> Should we add a unit test for that?
>
> Great call! I've added this to the testing plan with unit and end-to-end
> tests (seems fairly cheap to do both; LMK if e2e feels like overkill
> though).
>
>
> Thanks again for all of your thoughts!
>
> Cheers,
>
> Chris
>
> On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com>
> wrote:
>
> > Hi Chris, thanks. This looks like a useful feature.
> >
> > Due to the idempotent nature of PUT, I guess that the last_modified
> > timestamp won't change if the same request is repeated successively.
> > Should we add a unit test for that?
> >
> > On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid>
> > wrote:
> > >
> > > Hi Chris,
> > >
> > > Thanks for thinking about this useful feature !
> > > I had a question regarding
> > >
> > > > Since cluster metadata is not required to handle these types of
> > request,
> > > they will not be forwarded to the leader
> > >
> > > And later, we also mention about supporting more scope types in the
> > future.
> > > Don't you foresee a future scope type which may require cluster
> metadata
> > ?
> > > In that case, isn't it better to forward the requests to the leader in
> > the
> > > initial implementation ?
> > >
> > > I would also recommend an additional system test for Standalone herder
> to
> > > ensure that the new scope parameter is honored and the response
> contains
> > > the last modified time.
> > >
> > > Thanks,
> > > Ashwin
> > >
> > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton <chr...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Can't imagine a worse time to publish a new KIP (it's late on a
> Friday
> > and
> > > > we're in the middle of the 3.6.0 release), but I wanted to put forth
> > > > KIP-976 for discussion:
> > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> > > >
> > > > TL;DR: The API to dynamically adjust log levels at runtime with
> > Connect is
> > > > great, and I'd like to augment it with support to adjust log levels
> for
> > > > every worker in the cluster (instead of just the worker that receives
> > the
> > > > REST request).
> > > >
> > > > I look forward to everyone's thoughts, but expect that this will
> > probably
> > > > take a bump or two once the dust has settled on 3.6.0. Huge thanks to
> > > > everyone that's contributed to that release so far, especially our
> > release
> > > > manager Satish!
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> >
>

Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Reply via email to