Hi Chris, > I agree with Yash here: we can cross that bridge when we come to it, unless there are problems that we'd encounter then that would be better addressed by adding request forwarding now. One potential argument I can think of is that the UX would be a little strange if the semantics for this endpoint differ depending on the value of the scope parameter (some values would be affected by in-progress rebalances, and some would not), but I don't know if making scope=cluster more brittle in the name of consistency with, e.g., scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?
Yes, the inconsistency in UX was what drove me to comment. I understand that this will add complexity (unnecessary for the current implementation) and am fine with deferring it for later. > Ah, great call! I love the new testing plan section. I also share Yash's concerns about adding a new system test though (at this point, they're so painful to write, test, and debug that in most circumstances I consider them a last resort). Do you think it'd be reasonable to add end-to-end verification for this with an integration test instead? Sounds good to me. Honestly, I did not know that system tests are more "heavy weight" than integration tests - so thanks for letting me know Yash & Chris. Cheers, Ashwin On Tue, Sep 5, 2023 at 8:51 PM Chris Egerton <chr...@aiven.io.invalid> wrote: > Hi all, > > Thank you so much for the generous review comments! Happy to see interest > in this feature. Inline responses follow. > > > Ashwin: > > > Don't you foresee a future scope type which may require cluster metadata > ? > In that case, isn't it better to forward the requests to the leader in the > initial implementation ? > > I agree with Yash here: we can cross that bridge when we come to it, unless > there are problems that we'd encounter then that would be better addressed > by adding request forwarding now. One potential argument I can think of is > that the UX would be a little strange if the semantics for this endpoint > differ depending on the value of the scope parameter (some values would be > affected by in-progress rebalances, and some would not), but I don't know > if making scope=cluster more brittle in the name of consistency with, e.g., > scope=connectorType:foo is really a worthwhile tradeoff. Thoughts? > > > I would also recommend an additional system test for Standalone herder to > ensure that the new scope parameter is honored and the response contains > the last modified time. > > Ah, great call! I love the new testing plan section. I also share Yash's > concerns about adding a new system test though (at this point, they're so > painful to write, test, and debug that in most circumstances I consider > them a last resort). Do you think it'd be reasonable to add end-to-end > verification for this with an integration test instead? > > > Yash: > > > From the proposed changes section, it isn't very clear to me how we'll be > tracking this last modified timestamp to be returned in responses for the > *GET > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could you > please elaborate on this? Also, will we track the last modified timestamp > even for worker scoped modifications where we won't write any records to > the config topic and the requests will essentially be processed > synchronously? > > RE timestamp tracking: I was thinking we'd store the timestamp for each > level in-memory and, whenever we change the level for a namespace, update > its timestamp to the current wallclock time. Concretely, this means we'd > want the timestamp for some logger `logger` to be as soon as possible after > the call to `logger.setLevel(level)` for some level `level`. I'm honestly > unsure how to clarify this further in the KIP; is there anything in there > that strikes you as particularly ambiguous that we can tweak to be more > clear? > > RE scope distinction for timestamps: I've updated the KIP to clarify this > point, adding this sentence: "Timestamps will be updated regardless of > whether the namespace update was applied using scope=worker or > scope=cluster.". Let me know what you think > > > In the current synchronous implementation for the *PUT > /admin/loggers/{logger} *endpoint, we return a 404 error if the level is > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new cluster > scoped variant of the endpoint will be asynchronous, can we also add a > validation to synchronously surface erroneous log levels to users? > > Good call! I think we don't have to be explicit about this in the proposed > changes section, but it's a great fit for the testing plan, where I've > added it: "Ensure that cluster-scoped requests with invalid logging levels > are rejected with a 404 response" > > > I'm curious to know what the rationale here is? In KIP-745, the stated > reasoning behind ignoring restart requests during worker startup was that > the worker will anyway be starting all connectors and tasks assigned to it > so the restart request is essentially meaningless. With the log level API > however, wouldn't it make more sense to apply any cluster scoped > modifications to new workers in the cluster too? This also applies to any > workers that are restarted soon after a request is made to *PUT > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we could > stack up all the cluster scoped log level modification requests during the > config topic read at worker startup and apply the latest ones per namespace > (assuming older records haven't already been compacted) after we've > finished reading to the end of the config topic? > > It's definitely possible to implement persistent cluster-wide logging level > adjustments, with the possible exception that none will be applied until > the worker has finished reading to the end of the config topic. The > rationale for keeping these updates ephemeral is to continue to give > priority to workers' Log4j configuration files, with the underlying > philosophy that this endpoint is still only intended for debugging > purposes, as opposed to cluster-wide configuration. Permanent changes can > already be made by tweaking the Log4j file for a worker and then restarting > it. If a restart is too expensive for a permanent change, then the change > can be applied immediately via the REST API, and staged via the Log4j > configuration file (which will then be used the next time the worker is > restarted, whenever that happens). > > We could potentially explore persistent cluster-wide logging adjustments in > the future (or, if people feel strongly, right now), but it'd be more > complex. How would we handle the case where a user has updated their Log4j > configuration file but continues to see the same logging level for a > namespace, in a way that greatly reduces or even eliminates the odds of > headaches or footguns? How would we allow users to reset or undo > cluster-wide configuration updates, in order to revert back to whatever's > specified in their Log4j configuration file? Would we want to give users > control over whether a dynamic update is persistent or ephemeral? What > would the implications be for finer-grained scoping (especially with > regards to resetting dynamic changes)? Overall it doesn't seem worth the > work to tackle all of this right now, as long as we're not painting > ourselves into a corner. But as always, happy to hear what you and others > think! > > > Federico: > > > Due to the idempotent nature of PUT, I guess that the last_modified > timestamp won't change if the same request is repeated successively. > Should we add a unit test for that? > > Great call! I've added this to the testing plan with unit and end-to-end > tests (seems fairly cheap to do both; LMK if e2e feels like overkill > though). > > > Thanks again for all of your thoughts! > > Cheers, > > Chris > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com> > wrote: > > > Hi Chris, thanks. This looks like a useful feature. > > > > Due to the idempotent nature of PUT, I guess that the last_modified > > timestamp won't change if the same request is repeated successively. > > Should we add a unit test for that? > > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid> > > wrote: > > > > > > Hi Chris, > > > > > > Thanks for thinking about this useful feature ! > > > I had a question regarding > > > > > > > Since cluster metadata is not required to handle these types of > > request, > > > they will not be forwarded to the leader > > > > > > And later, we also mention about supporting more scope types in the > > future. > > > Don't you foresee a future scope type which may require cluster > metadata > > ? > > > In that case, isn't it better to forward the requests to the leader in > > the > > > initial implementation ? > > > > > > I would also recommend an additional system test for Standalone herder > to > > > ensure that the new scope parameter is honored and the response > contains > > > the last modified time. > > > > > > Thanks, > > > Ashwin > > > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton <chr...@aiven.io.invalid> > > > wrote: > > > > > > > Hi all, > > > > > > > > Can't imagine a worse time to publish a new KIP (it's late on a > Friday > > and > > > > we're in the middle of the 3.6.0 release), but I wanted to put forth > > > > KIP-976 for discussion: > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect > > > > > > > > TL;DR: The API to dynamically adjust log levels at runtime with > > Connect is > > > > great, and I'd like to augment it with support to adjust log levels > for > > > > every worker in the cluster (instead of just the worker that receives > > the > > > > REST request). > > > > > > > > I look forward to everyone's thoughts, but expect that this will > > probably > > > > take a bump or two once the dust has settled on 3.6.0. Huge thanks to > > > > everyone that's contributed to that release so far, especially our > > release > > > > manager Satish! > > > > > > > > Cheers, > > > > > > > > Chris > > > > > > >