Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

Randall Hauch Mon, 16 Sep 2019 16:28:31 -0700

On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton <chr...@confluent.io> wrote:


> Hi Randall,
>
> The new default value for the key size configuration will be "null". I've
> clarified this in the KIP. This will still preserve backwards compatibility
> and should not be an issue.
>

Thanks for clarifying this in the KIP.


>
> I understand your point about key generation vs MAC signing algorithms;
> like I said, I'll need to do more research.
>
> I respectfully disagree that a single algorithm list would be easier to
> understand as opposed to a list of accepted algorithms and a signature
> algorithm. Placing special priority on the first element in that list is
> fairly implicit and leaves room for confusion where users might have the
> same algorithms in their lists for that config but in different orders for
> different workers. The group coordination protocol selection behavior isn't
> really a great example since users don't configure that directly
> themselves.
>
> RE the proposed set of new configs: like I stated in my previous response,
> I object to the use of "cluster." as a configuration prefix for any worker
> configs: "most configurations deal with the worker and, transitively, the
> cluster; there doesn't seem to be good enough cause for including that
> prefix for these configurations." I also think this discussion should
> probably continue a little more before we start proposing new configuration
> names, given that it's still undecided which configurations we're going to
> expose.
>
> > I don't think we have any data about how how often a follower will be
> fully caught up, and it's possible that a worker's consumer fails to keep
> the worker caught up quickly enough with the configs topic and the new
> session key. So can we really say that a follower making a request with an
> expired key will be rare?
> It would depend on the key rotation interval, but I can't imagine the
> likelihood of this occurring with an interval as low as even 10 minutes
> would be that high. The config topic is low-volume; the consumer for that
> topic isn't going to be flooded with writes and it seems fine to expect
> fairly low latency for the consumer of this topic.
>

My concern is that we're really *assuming* it's not a problem. All I'm
asking for is more clarity on what happens when a worker doesn't know about
the new session key when it makes a request to this REST resource? The KIP
makes it clear that It will be retried and that the existing error message
will be replaced with a debug message, at least for a time being. Perhaps
the KIP paragraph that talks about this can be reworded to make this more
clear, something akin to:

"The leader will only accept requests signed with the most current key.
However, Connect follower workers may routinely experience small delays
when reading the new key. Rather than always logging such task
configuration failure and retry attempts as errors (the current behavior),
Connect's distributed herder will be modified slightly to handle such HTTP
403 responses for this task configuration request by quietly retrying them
with the latest key for up to 1 minute. If failures persist for more than 1
minute, they will be logged as errors."


> > Can we not retry for longer than 1 second if the request fails with HTTP
> 403? I'm concerned what the UX will be if/when this happens, and that the
> user sees a very obtuse and seemingly unrelated error message they won't
> know how to fix.
> To be clear, the KIP doesn't propose any changes to the infinite retry
> logic that's present in the worker today. All that the KIP proposes is that
> an existing error-level message be replaced with a debug-level message if
> it's suspected that a request has failed due to an out-of-date key. With
> that out of the way, sure, we can bump the grace period before beginning to
> emit error-level log messages. I think going as high as minute might be
> acceptable; we should try to stay fairly low, however, in case the request
> failures are due to some other reason that should be surfaced as soon as
> possible and with some urgency.
>

Ack. See above .

>
> > The text in the "Failure to relay task configurations to leader due to
> incorrect configuration" section is similarly ambiguous. How will a user
> know that this is occurring, and is this really an unlikely situation
> (e.g., "This scenario is unlikely to occur with any normal usage of
> Connect.")? Seems like users unintentionally misconfiguring some of their
> Connect workers is quite common.
> A user will know that this is occurring based on error-level log messages
> emitted by the worker about a failure to relay task ("Failed to reconfigure
> connector's tasks, retrying after backoff:"), plus a stack trace. Yes, this
> situation is unlikely; most worker files will never contain the
> newly-proposed configurations for this KIP since the default values should
> suffice for most cases. If anyone tries to adjust these values, we will
> have documentation available on what their behavior is and what to do to
> ensure your cluster doesn't break while changing them (including the
> rolling upgrade procedure outlined in the KIP). These configurations make
> it no likelier that a user will accidentally break their Connect cluster
> than the group ID, key/value converter, REST extension, and broker
> authentication configs (to name just a few), and since they will be left
> out of any sample worker configurations included in Kafka, the odds of
> someone accidentally messing with them are low enough that it doesn't seem
> worth investing a lot of effort into making it harder for someone to shoot
> themselves in the foot.
> I'll update the KIP to include possible indicators that the cluster has
> been misconfigured, but I don't think this scenario deserves a ton of
> priority.
>
> > Maybe provide a bit more description of what these error messages will
> be.
> I believe this is more of an implementation detail and should be left to PR
> review. KIPs should only be expected to be so fine-grained; proposing
> actual log messages doesn't seem necessary for the overall
> adoption/rejection of the mechanisms proposed here and the iteration cycle
> on GitHub is significantly faster and more efficient than on a mailing
> list.
>
> > Do we need a JMX metric that shows the protocol that each worker is
> configured to use, and whether the workers are using session keys? This
> would be a great way to monitor the cluster's use of this feature.
> I think exposing the connect protocol in use would be useful, sure. You
> could deduce whether workers are using session keys based on this so
> there'd only be a need to expose one additional metric.
>

Thanks!

Randall


>
> Thanks again for the feedback!
>
> Cheers,
>
> Chris
>
> On Mon, Sep 16, 2019 at 11:58 AM Randall Hauch <rha...@gmail.com> wrote:
>
> > Thanks, Chris. I still have a number of questions and requests for
> > clarification.
> >
> > If we don't provide a default value for the key size, then the following
> > statement in the "Backwards compatibility" subsection is no longer true:
> > "All of the proposed configurations here have default values, making them
> > backwards compatible." IIUC, a user will not be able to upgrade an
> existing
> > Connect cluster without editing their configurations, and this pretty
> much
> > means it is not backwards compatible, which is a non-starter.
> >
> > Regarding the new configurations and our earlier discussion about whether
> > all are required, I'm not convinced that we need separate configs for
> > signing and key generation algorithms. If this is a common need, then the
> > KIP should justify that with an explanation. But as it stands now,
> exposing
> > multiple algorithm properties now means the UX is more complicated and we
> > can't make things simpler in the future. I also think that a single
> > algorithm list where the first is used for signing would be easier to
> > understand (fewer is better) and would be more in line with the way
> > rebalance protocols are chosen the broker (see
> > https://www.youtube.com/watch?v=MmLezWRI3Ys for a decent explanation).
> If
> > at some point in the future we do want that extra flexibility, then we
> can
> > expose additional properties.
> >
> > I also asked earlier about renaming the config properties so they can be
> > used in other places within the cluster other than the task configs
> > request, which is something I think we'll want to do. If we minimize the
> > number of configs, then how about `cluster.session.algorithms`,
> > `cluster.session.key.size` and `cluster.session.key.ttl.ms`?
> >
> > The "Proposed Changes" section now mentions:
> >
> > The leader will only accept requests signed with the most current key.
> This
> > > should not cause any major problems; if a follower attempts to make a
> > > request with an expired key (which should be quite rare and only occur
> if
> > > the request is made by a follower that is not fully caught up to the
> end
> > of
> > > the config topic), the initial request will fail, but will be
> > subsequently
> > > retried after a backoff period. This backoff period should leave
> > sufficient
> > > room for the rebalance to complete.
> > >
> >
> > I don't think we have any data about how how often a follower will be
> fully
> > caught up, and it's possible that a worker's consumer fails to keep the
> > worker caught up quickly enough with the configs topic and the new
> session
> > key. So can we really say that a follower making a request with an
> expired
> > key will be rare?
> >
> > If the first four requests fail with HTTP 403, it will be assumed that
> this
> > > is due to an out-of-date session key; a debug-level message about the
> > > subsequent retry will be logged in place of the current error-level log
> > > message of "Failed to reconfigure connector's tasks, retrying after
> > > backoff: " followed by a stack trace. Since the backoff period is 250
> > > milliseconds, this should give at least one second of leeway for an
> > > outdated key to be updated. If longer than that is required, the usual
> > > error-level log messages will begin to be generated by the worker.
> > >
> >
> > Can we not retry for longer than 1 second if the request fails with HTTP
> > 403? I'm concerned what the UX will be if/when this happens, and that the
> > user sees a very obtuse and seemingly unrelated error message they won't
> > know how to fix.
> >
> > The text in the "Failure to relay task configurations to leader due to
> > incorrect configuration" section is similarly ambiguous. How will a user
> > know that this is occurring, and is this really an unlikely situation
> > (e.g., "This scenario is unlikely to occur with any normal usage of
> > Connect.")? Seems like users unintentionally misconfiguring some of their
> > Connect workers is quite common.
> >
> > Additionally, it will be vital to design appropriate error messages for
> > > this scenario so that users can (hopefully) dig themselves out of that
> > hole
> > > on their own.
> >
> >
> > Maybe provide a bit more description of what these error messages will
> be.
> >
> > Do we need a JMX metric that shows the protocol that each worker is
> > configured to use, and whether the workers are using session keys? This
> > would be a great way to monitor the cluster's use of this feature.
> >
> > Best regards,
> >
> > Randall
> >
> >
> > On Wed, Sep 11, 2019 at 7:00 PM Chris Egerton <chr...@confluent.io>
> wrote:
> >
> > > Hi all,
> > >
> > > I've updated KIP-507 to reflect the changes inspired by Randall's
> recent
> > > feedback. In addition, after some further research, I've decided to
> > remove
> > > the proposed default value for the internal.request.key.size and
> instead,
> > > should no value be provided, rely on the default key size for the given
> > key
> > > algorithm. More detail can be found in the KIP if anyone's interested.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Sep 10, 2019 at 3:18 PM Chris Egerton <chr...@confluent.io>
> > wrote:
> > >
> > > > Hi Randall,
> > > >
> > > > Thanks so much for your comprehensive feedback! To avoid creating an
> > > > extremely long response I'll address your comments/questions by
> > > referencing
> > > > the identifiers you've provided for them as opposed to copying them
> and
> > > > responding inline; hopefully things are still readable :)
> > > >
> > > >
> > > > RE new configs:
> > > >
> > > > a) I believe exposing these configurations right now is the correct
> > > > approach. There are two scenarios that we should aim to support:
> > running
> > > > Connect on a bare-bones JVM that doesn't implement anything beyond
> the
> > > > requirements of the JVM specs in terms of key generation and key
> > signing
> > > > algorithms, and running Connect in an environment with hard security
> > > > requirements for, e.g., FIPS compliance. The default values for these
> > > > configurations are intended to satisfy the first use case; however,
> in
> > > > order to enable users to conform with higher security standards (the
> > > second
> > > > use case), some key generation and key signing algorithms that are
> not
> > > > guaranteed to be present on every implementation of the JVM may be
> > > > required. I don't see a way to satisfy both use cases without adding
> > > > configurability.
> > > > With regards to key size: yes, this needs to be configurable as there
> > are
> > > > different ranges of acceptable key size for different key generation
> > > > algorithms; for example, the "AES" algorithm for key generation
> > requires
> > > a
> > > > key size of either 128, 192 or 256 (at least on my local JVM) whereas
> > the
> > > > "DES" algorithm requires a key size of exactly 56.
> > > > As far as enabling different algorithms for key generation vs.
> request
> > > > signing goes... I'm not sure. Local tests involving keys generated
> with
> > > > different algorithms from the mac algorithms used to sign inputs
> (e.g.,
> > > > combining an HmacSHA256 key with an HmacMD5 mac or using a DES key
> with
> > > an
> > > > HmacSHA256 mac) haven't thrown any exceptions yet but this is a
> little
> > > > beyond the boundaries of my current knowledge, so I'll have to get
> back
> > > to
> > > > you on that point. At the very least, your question (and the research
> > > I've
> > > > done so far to try to address it) has highlighted a bug in my current
> > PR
> > > > that'll definitely need to be fixed :)
> > > >
> > > > b) The riskiest scenario is if a follower worker is configured to
> use a
> > > > request signing algorithm that isn't allowed by the leader. In this
> > case,
> > > > any failure will only occur if/when that follower starts up a
> connector
> > > and
> > > > then has to forward tasks for that connector to the leader, which may
> > not
> > > > happen immediately. Once that failure occurs, an endless failure loop
> > > will
> > > > occur wherein the follower endlessly retries to send those task
> > > > configurations to the leader and pauses by the backoff interval in
> > > between
> > > > each failed request.
> > > > There are two ways to rectify this situation; either shut down the
> > > > follower and restart it after editing its configuration to use a
> > request
> > > > signing algorithm permitted by the leader, or shut down all other
> > workers
> > > > in the cluster that do not permit the request signing algorithm used
> by
> > > the
> > > > follower, reconfigure them to permit it, and then restart them.
> > > > This scenario is unlikely to occur with any normal usage of Connect,
> > but
> > > > the circumstances leading to it will need to be called out very
> > carefully
> > > > in order to avoid putting the cluster into a bad state and flooding
> > logs
> > > > with scary-looking error messages. Speaking of, it'll be vital to
> > design
> > > > appropriate error messages for this scenario so that users can
> > > (hopefully)
> > > > dig themselves out of that hole on their own.
> > > > Differing values for any of the other configs shouldn't actually be
> too
> > > > problematic, given that the three remaining configs all deal with key
> > > > generation/rotation, which is only handled by one worker at a time
> (the
> > > > leader).
> > > >
> > > > c) Good call. Yes, these configs are all "low" importance and I'll
> > update
> > > > the KIP accordingly to reflect that.
> > > >
> > > > d) No, there is no window when multiple keys are valid. This is
> > > explicitly
> > > > called out in the KIP at the end of the "Proposed Changes" section:
> > > > > The leader will only accept requests signed with the most current
> > key.
> > > > This should not cause any major problems; if a follower attempts to
> > make
> > > a
> > > > request with an expired key (which should be quite rare and only
> occur
> > if
> > > > the request is made by a follower that is not fully caught up to the
> > end
> > > of
> > > > the config topic), the initial request will fail, but will be
> > > subsequently
> > > > retried after a backoff period.
> > > >
> > > > e) I'll update the KIP to reflect the fact that, barring any need for
> > > > heightened levels of security compliance, none of these
> configurations
> > > > should need to be altered and the defaults should suffice.
> > > >
> > > > f) As mentioned above, a larger key size isn't necessarily possible
> for
> > > > all key generation algorithms. I believe these defaults are the best
> we
> > > can
> > > > work with until/unless other algorithms are added to the JVM spec.
> > > >
> > > > g) I don't believe there's a lot of precedent for configuration
> > > properties
> > > > like this in AK. The closest I could find was the
> "password.encoder.*"
> > > > broker properties, but those deal with storing and encrypting
> > passwords,
> > > as
> > > > opposed to generating keys and signing requests with them. I'm
> leaning
> > > > towards the stance that it's not worth calling out this lack of
> > precedent
> > > > in the KIP itself but if you or others feel differently I'll go ahead
> > and
> > > > add a small section on it.
> > > >
> > > > h) These configs were (are) definitely targeted toward validating
> > > internal
> > > > REST traffic, although in theory I suppose they could be used to
> > validate
> > > > any inter-worker communication that can be reduced to a byte array
> for
> > > > signing. I think that the mechanism proposed here should be limited
> to
> > > use
> > > > for internal communications only, so the "internal." prefix still
> seems
> > > > appropriate. Using "cluster." in conjunction with that
> > > > ("internal.cluster.key.generation.algorithm", for example), doesn't
> > feel
> > > > quite right as it might raise the question of what an "internal
> > cluster"
> > > > is. Reversing the order ("cluster.internal.key.generation.algorithm")
> > > seems
> > > > redundant; most configurations deal with the worker and,
> transitively,
> > > the
> > > > cluster; there doesn't seem to be good enough cause for including
> that
> > > > prefix for these configurations.
> > > > Two alternatives I can think of are "internal.communication." and
> > > > "inter.worker.", but I'm open to suggestions. I'll leave the configs
> > > as-are
> > > > pending further discussion; these things tend to change rapidly as
> new
> > > > minds join the conversation.
> > > >
> > > > i) Yeah, some startup analysis to make sure that the list of
> permitted
> > > > verification algorithms includes the signature algorithm is probably
> > > > warranted. However, an approach of taking the first permitted
> algorithm
> > > and
> > > > using that for verification doesn't really buy us much in terms of
> > > > preventing misconfiguration errors from occurring; the biggest
> problem
> > > > would be if these configurations weren't aligned across workers
> (i.e.,
> > if
> > > > the signature algorithm on a follower wasn't included in the leader's
> > > list
> > > > of permitted algorithms) and that still wouldn't be caught at
> startup.
> > > > I'd advocate for keeping the two configurations separate, carefully
> > > > documenting them, and enabling a startup check that forces at least
> > that
> > > > worker to be consistent with its own configs, but I think morphing
> > those
> > > > two configs into one is a little too implicit and unintuitive to be
> > worth
> > > > it.
> > > >
> > > >
> > > > RE error log messages caused by not-yet-updated workers reading
> session
> > > > keys from the config topic: in the most common upgrade scenario, this
> > > will
> > > > never happen. No session key will be distributed by the leader until
> > > > internal request verification is enabled, which will only happen once
> > all
> > > > workers have indicated that they support the new "sessioned"
> > subprotocol
> > > > during group coordination. The only circumstance that would lead to
> > that
> > > > kind of error message would be if a new worker joins a cluster that
> has
> > > > already been fully upgraded to use the new "sessioned" subprotocol,
> but
> > > for
> > > > some reason this new worker is running an older version of the
> Connect
> > > > framework. In this case, the error message is probably warranted;
> it's
> > > > likely that something's going wrong here.
> > > >
> > > > RE potential error log messages due to stale keys: Mmm, good point.
> My
> > > > first instinct for rectifying this would be to differentiate the
> > various
> > > > causes of rejection for a request to this endpoint via HTTP status;
> 400
> > > > (bad request) for requests that either lack the required signature
> and
> > > > signature algorithm headers, specify an invalid
> (non-base-64-decodable)
> > > > signature, or specify a signature algorithm that isn't permitted by
> the
> > > > leader, and 403 (forbidden) for requests that contain well-formed
> > values
> > > > for the signature and signature algorithm headers, but which fail
> > request
> > > > verification. A follower can catch that 403 and, instead of logging a
> > > > scary-looking ERROR level message, retry a limited number of times
> with
> > > > something INFO level before raising the stakes and making the error
> > > message
> > > > look scarier. I'll add this to the KIP but if you or anyone has other
> > > > thoughts I'm happy to hear them.
> > > >
> > > > RE advertising the availability/current usage of internal request
> > > signing:
> > > > I think the primary mechanism that should be used to communicate this
> > to
> > > > the user is the worker log. Theoretically, you can just run `curl -H
> > > > 'Content-Type: application/json' -d '[]' http://<connect
> > > > worker>/anything/tasks` and if the feature is enabled, you'll get a
> 400
> > > and
> > > > if it's disabled you'll either get a 404 or a 200 (if a connector
> with
> > > that
> > > > name actually exists). However, nobody wants to do that :) So
> instead,
> > I
> > > > think the following log messages would be appropriate for the
> following
> > > > scenarios:
> > > > 1) A worker is started up with connect.protocol set to something
> other
> > > > than "sessioned" - a WARN level log message is emitted explaining
> that
> > > the
> > > > Connect cluster will be vulnerable because internal request signing
> is
> > > > disabled, and providing information on how to enable request signing
> if
> > > > desired.
> > > > 2) A worker with connect.protocol set to "sessioned" (or, in the
> > future,
> > > > any protocol that enables internal request signing) joins the cluster
> > > > group, and it is observed that the subprotocol used by the cluster
> does
> > > not
> > > > indicate support for internal request singing - a WARN level log
> > message
> > > is
> > > > emitted explaining that even though this worker supports internal
> > request
> > > > signing, one or more workers in the cluster do not support this
> > behavior
> > > > and therefore it will be disabled; information should also be
> provided
> > on
> > > > how to identify which workers these are and how to enable request
> > signing
> > > > with them if desired.
> > > > I don't think an error is appropriate to log in either of these
> cases,
> > > > since it's possible that an intentional downgrade might be necessary.
> > > I'll
> > > > add this information (including proposed log message scenarios) to
> the
> > > KIP.
> > > >
> > > > RE upgrade UX: I think I've covered the reasoning behind exposing
> these
> > > > configurations even though the defaults are also reasonable. I can
> call
> > > out
> > > > the automatic enabling of request signature verification and
> highlight
> > > the
> > > > (hopefully seamless) standard upgrade path of a single rolling
> upgrade
> > in
> > > > the KIP.
> > > >
> > > > RE rejected alternatives: Yep, good point. Will add to the KIP.
> > > >
> > > >
> > > > Thanks again for all the feedback! Looking forward to hearing more.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Mon, Sep 9, 2019 at 4:49 PM Randall Hauch <rha...@gmail.com>
> wrote:
> > > >
> > > >> Thanks for putting this KIP together, Chris. It's thorough and well
> > > >> thought
> > > >> out, and you've done a great job responding to comments. It is
> indeed
> > > >> going
> > > >> to be nice to harden the REST API a bit more.
> > > >>
> > > >> I do have a few questions/concerns/comments, all of which I think
> can
> > be
> > > >> incorporated relatively easily.
> > > >>
> > > >> First, regarding the new Connect worker configs mentioned in the
> > "Public
> > > >> Interfaces" section:
> > > >> a. Do we need to expose all of these, or could Connect just
> internally
> > > use
> > > >> constants and only expo them in the future if necessary? Do users
> > really
> > > >> need to explicitly set any of these, or will most users simply use
> the
> > > >> defaults? Do we need to specify a different algorithm for
> generating a
> > > >> session key versus signing a request? Does the key size need to be
> > > >> configured, or can we use a large enough value and grow this in the
> > > future
> > > >> with patch releases?
> > > >> b. What happens if one worker has different values for any of these
> > new
> > > >> worker configs than the other workers? If there is an error, when
> will
> > > >> that
> > > >> error occur (before startup happens, only when that worker attempts
> to
> > > use
> > > >> the internal REST endpoint, etc.) and will the error be clear enough
> > the
> > > >> user knows how to fix the problem?
> > > >> c. The section should mention the importance of each new
> configuration
> > > >> property. I suspect these will be "low" importance, since the
> defaults
> > > are
> > > >> likely sufficient for most users.
> > > >> d. Is there a window when both the new session key and the prior
> > session
> > > >> key are still valid? If so, what is that window?
> > > >> e. Can the KIP emphasize more that the default values are likely to
> be
> > > >> sufficient for most users?
> > > >> f. Are all of the defaults sufficient for the foreseeable future?
> For
> > > >> example, what is the cost of using a larger session key size?
> > > >> g. Please mention whether these config properties follow or do not
> > > follow
> > > >> existing precedent within AK.
> > > >> h. Might these settings be used to validate other kinds of
> > intra-cluster
> > > >> communication and messages? If so, the "internal.request" prefix
> might
> > > >> make
> > > >> that difficult. Was there any thought about using a bit more generic
> > > >> prefix, such as "cluster."?
> > > >> i. The `internal.request.verification.algorithms` and
> > > >> `internal.request.signature.algorithm` are dependent upon each
> other,
> > > and
> > > >> it seems like it's possible for the "verification" algorithms to not
> > > >> include the "signature" algorithm. We could catch cases like that
> > > (though
> > > >> not with simple Validator implementations since they are specific
> to a
> > > >> single property value), but would it be better to have a single list
> > and
> > > >> to
> > > >> use the first item in the list as the signature algorithm?
> > > >>
> > > >> Second, the KIP proposes to write the session key to the existing
> > config
> > > >> topic using record(s) with a new key. Konstantine was correct when
> he
> > > said
> > > >> earlier in the thread:
> > > >>
> > > >>
> > > >> > Indeed, broadcasting the session key to the followers is
> necessary.
> > > But
> > > >> > adding it to the configs topic with a new key is compatible with
> > > >> previous
> > > >> > versions (the key will be ignored by workers that don't support
> this
> > > >> > protocol) and works since all workers will need to read this topic
> > to
> > > >> the
> > > >> > end to remain in the group.
> > > >> >
> > > >>
> > > >> This is true that the KafkaConfigBackingStore will not act upon
> config
> > > >> record keys with keys that do not match the known pattern, but such
> > > >> messages do result in an ERROR log message. I'm concerned that
> > operators
> > > >> will be alarmed (both emotionally and technically) if they begin to
> > > >> upgrade
> > > >> a Connect cluster and start seeing ERROR log messages. We could
> change
> > > >> these error messages to be warnings, but that will take effect only
> > > after
> > > >> a
> > > >> worker is upgraded and restarted, and will not affect the
> > > >> yet-to-be-upgraded workers in the Connect cluster. What can we do to
> > > avoid
> > > >> these ERROR log messages? Should the default be to not change
> > > >> `connect.protocol`, allow the user to upgrade all clusters, and then
> > to
> > > >> change `connect.protocol=sessioned` with a (second) rolling upgrade?
> > Or,
> > > >> if
> > > >> we decide to rely upon an upgrade recipe to avoid these, then we
> > should
> > > >> document that recipe here.
> > > >>
> > > >> Third, I think the KIP should address the potential for seeing one
> or
> > > more
> > > >> "Failed to reconfigure connector's tasks, retrying after backoff"
> > error
> > > >> message during a key rotation. It would be good to eliminate the
> > > >> condition,
> > > >> maybe by returning a response indicating authorization failure and
> > > >> signifying a key rotation is required, and to prevent an ERROR log
> > > >> message.
> > > >> I don't think separate INFO level log messages saying to disregard
> > > earlier
> > > >> ERROR log messages is sufficient.
> > > >>
> > > >> Fourth, how will an operator of a Connect cluster know whether this
> > > >> internal endpoint is protected by authorization via this feature?
> And
> > > how
> > > >> can an operator know which Connect workers are preventing a cluster
> > from
> > > >> enabling this feature? Should a warning or error be logged if this
> > > feature
> > > >> is disabled after being enabled?
> > > >>
> > > >> Finally, I have a few nits:
> > > >>
> > > >>    1. The "Backwards compatibility" section should be more focused
> on
> > > the
> > > >>    upgrade UX. "All of the new worker configurations have sensible
> > > >> defaults,
> > > >>    and most users can simply upgrade without needing to override
> > them."
> > > >> (BTW,
> > > >>    if this is true, then IMO that adds weight to only exposing a
> > minimum
> > > >>    number of new worker configs.)
> > > >>    2. The rejected alternatives should include the earlier approach
> of
> > > >>    sharing the session keys over the Connect subprotocol, with pros
> > and
> > > >> cons.
> > > >>
> > > >> Overall, very nice work, Chris.
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Randall
> > > >>
> > > >> On Fri, Sep 6, 2019 at 4:49 PM Chris Egerton <chr...@confluent.io>
> > > wrote:
> > > >>
> > > >> > Hi all,
> > > >> >
> > > >> > I've published a draft PR implementing the changes currently
> > proposed
> > > in
> > > >> > KIP-507, which can be found at
> > > >> https://github.com/apache/kafka/pull/7310.
> > > >> > Thanks for all of the review and feedback so far! Given the lack
> of
> > > any
> > > >> > major voiced reservations so far, if I don't hear anything else
> over
> > > the
> > > >> > course of the next few days or so I'll be opening this for a vote.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Chris
> > > >> >
> > > >> >
> > > >> > On Wed, Aug 28, 2019 at 12:21 PM Chris Egerton <
> chr...@confluent.io
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > HI all,
> > > >> > >
> > > >> > > Wow, thanks for all the feedback! Happy to see this proposal
> > getting
> > > >> some
> > > >> > > love :)
> > > >> > >
> > > >> > >
> > > >> > > RE Konstantine's comments:
> > > >> > >
> > > >> > > > I've read the discussions regarding why the rebalancing
> protocol
> > > is
> > > >> > used
> > > >> > > here and your intention to follow the approach which was
> recently
> > > >> used in
> > > >> > > order to elegantly support upgrades without requiring rolling
> > > restarts
> > > >> > > makes sense.
> > > >> > > > However, I'd like to revisit the proposed extension of the
> > connect
> > > >> > > protocol
> > > >> > > going a little bit more in detail.
> > > >> > > Indeed, broadcasting the session key to the followers is
> > necessary.
> > > >> But
> > > >> > > adding it to the configs topic with a new key is compatible with
> > > >> previous
> > > >> > > versions (the key will be ignored by workers that don't support
> > this
> > > >> > > protocol) and works since all workers will need to read this
> topic
> > > to
> > > >> the
> > > >> > > end to remain in the group.
> > > >> > > > Additionally, to your point about consensus, meaning how the
> > > workers
> > > >> > all
> > > >> > > know during the same generation that the "sessioned" version of
> > the
> > > >> > > protocol was chosen by the leader, it seems to me that an
> explicit
> > > >> > upgrade
> > > >> > > of the protocol on the wire, on its name and on the metadata
> that
> > > are
> > > >> > sent
> > > >> > > with the join request by each worker might not be required.
> What I
> > > >> > believe
> > > >> > > would suffice is merely an increment of the version of the
> > protocol,
> > > >> > > because all the other information remains the same (for example,
> > the
> > > >> list
> > > >> > > of assigned and revoked tasks, etc). This would allow us not to
> > > extend
> > > >> > the
> > > >> > > metadata even more with yet another JoinGroupRequest and could
> > > achieve
> > > >> > what
> > > >> > > you want in terms of an orchestrated upgrade.
> > > >> > >
> > > >> > > I really like this idea, thanks for suggesting it. A simple bump
> > of
> > > >> the
> > > >> > > protocol version does seem sufficient to achieve consensus among
> > > >> workers
> > > >> > > and would significantly reduce the implementation complexity of
> an
> > > >> > entirely
> > > >> > > separate protocol just for the sake of distributing keys, and
> > using
> > > >> the
> > > >> > > config topic to relay keys instead of the group coordination
> > > protocol
> > > >> > would
> > > >> > > prevent rebalances solely for the sake of key rotation (instead,
> > the
> > > >> > leader
> > > >> > > can just periodically write a new key to the config topic). I'll
> > > >> update
> > > >> > the
> > > >> > > KIP by EOD today with these changes.
> > > >> > >
> > > >> > > > All the leader worker would have to check is whether all the
> > > >> > assignments
> > > >> > > that it took were in "sessioned" version (possibly
> > > >> CONNECT_PROTOCOL_V2=2
> > > >> > if
> > > >> > > you'd choose to go this route).
> > > >> > >
> > > >> > > I like this idea for dictating whether the leader requires
> request
> > > >> > > signing, SGTM. I think we can have all workers use the latest
> > > session
> > > >> key
> > > >> > > present in the config topic (if there is one) to sign their
> > > requests;
> > > >> > since
> > > >> > > the signatures are passed via header, this shouldn't cause any
> > > >> problems
> > > >> > if
> > > >> > > a follower signs a request that gets sent to a leader that isn't
> > > >> > performing
> > > >> > > request validation.
> > > >> > >
> > > >> > >
> > > >> > > RE Magesh's comments:
> > > >> > >
> > > >> > > > Thanks a lot for the KIP. This will certainly be a useful
> > > feature. I
> > > >> > > would
> > > >> > > have preferred to use the topic approach as well but I also
> > > understand
> > > >> > your
> > > >> > > point of view about the operational complexity for upgrades. If
> > not
> > > >> with
> > > >> > > this KIP, I would certainly want to go that route at some point
> in
> > > the
> > > >> > > future.
> > > >> > >
> > > >> > > I'm actually a little conflicted on this front. If this KIP
> passes
> > > >> and we
> > > >> > > have a secure way to perform quick, synchronous,
> > follower-to-leader
> > > >> > > communication, it may be better to hold onto and even perhaps
> > expand
> > > >> on
> > > >> > in
> > > >> > > case this functionality comes in handy later. Additionally,
> there
> > > >> would
> > > >> > > have to be a lot of thought put into the semantics of using
> > > >> topic-based
> > > >> > > communication for relaying task configs to the leader due to its
> > > >> > > asynchronous nature; I'm not convinced it's not worth it, but
> I'm
> > > also
> > > >> > not
> > > >> > > entirely convinced it is.
> > > >> > >
> > > >> > > > As far as using the rebalance protocol goes, it would be great
> > if
> > > >> you
> > > >> > > could
> > > >> > > elaborate on what exactly would be the rebalance impact when a
> key
> > > >> > expires.
> > > >> > > I see that you have called out saying that there should be no
> > > >> significant
> > > >> > > impact but it will be great to explicitly state as what is to be
> > > >> > expected.
> > > >> > > I would prefer to not have any sorts of rebalancing when this
> > > happens
> > > >> > since
> > > >> > > the connector and task assignments should not change with this
> > > event.
> > > >> It
> > > >> > > will be useful to explain this a bit more.
> > > >> > >
> > > >> > > Given Konstantine's comments, this question isn't strictly
> > relevant
> > > >> > > anymore (unless we decide to return to the approach of
> > distributing
> > > >> > session
> > > >> > > keys during rebalance), but just for completeness it does seem
> > worth
> > > >> > > addressing. The "rebalances" proposed for key rotation would be
> > > >> achieved
> > > >> > by
> > > >> > > a request from the leader the rejoin the group. This would
> > > immediately
> > > >> > > result in the leader being asked to assign connectors and tasks
> to
> > > the
> > > >> > > group, which would exactly match the existing assignments, so
> with
> > > >> that
> > > >> > and
> > > >> > > the help of the new incremental cooperative rebalance semantics
> > > there
> > > >> > would
> > > >> > > be zero downtime for connectors or tasks. Truly, "rebalance" is
> a
> > > bit
> > > >> of
> > > >> > a
> > > >> > > misnomer here; we're really talking about performing group
> > > assignment
> > > >> > > (which doesn't necessarily imply adding or removing
> > connectors/tasks
> > > >> > > to/from any workers), but as far as I've seen the two are
> > basically
> > > >> used
> > > >> > > synonymously w/r/t Connect so I elected to use the more common,
> > > albeit
> > > >> > > slightly less accurate, term.
> > > >> > >
> > > >> > >
> > > >> > > RE Greg's comments:
> > > >> > >
> > > >> > > > Does this roll-out ratchet the protocol version irreversibly?
> > > >> > >
> > > >> > > Nope, the rollout will be achieved by bumping the protocol (or,
> > > after
> > > >> > > adjusting for Konstantine's suggestion, just the protocol
> version)
> > > for
> > > >> > each
> > > >> > > worker, but the Kafka group coordinator will only begin using
> that
> > > new
> > > >> > > protocol if every worker indicates that that protocol is
> supported
> > > >> when
> > > >> > it
> > > >> > > joins the group. So, if a cluster existed with workers that all
> > > >> supported
> > > >> > > protocols 1 and 2 and indicated a preference for protocol 2, the
> > > >> cluster
> > > >> > > would use protocol 2. However, if a new worker that only
> supported
> > > >> > protocol
> > > >> > > 1 joined the cluster, the entire cluster would be bumped back
> down
> > > to
> > > >> > > protocol 1. If/when that worker leaves or indicates it supports
> > > >> protocol
> > > >> > 2,
> > > >> > > the cluster would bump back up to protocol 2.
> > > >> > >
> > > >> > > >What is the expected behavior when the feature has been
> enabled,
> > > and
> > > >> a
> > > >> > > worker joins the
> > > >> > > group while advertising an old protocol?
> > > >> > >
> > > >> > > I partially answered this above; the entire cluster would
> > downgrade
> > > >> its
> > > >> > > protocol version. The practical effect of this would be that
> > request
> > > >> > > signature verification by the leader would immediately cease.
> > > >> > >
> > > >> > > > I think it's reasonable to prevent single-worker downgrades,
> > since
> > > >> this
> > > >> > > is an attack vector.
> > > >> > > Actually, this is fine. Groups are a protected resource in
> Kafka;
> > if
> > > >> you
> > > >> > > want to ensure this doesn't happen, just secure your Kafka
> broker
> > > and
> > > >> use
> > > >> > > something like ACLs to make sure only authorized entities can
> join
> > > >> your
> > > >> > > group.
> > > >> > >
> > > >> > > > But what happens if every worker in a cluster goes down, and
> at
> > > >> least
> > > >> > > the first worker comes
> > > >> > > back with an old protocol? This is a potential attack that
> > requires
> > > >> > access
> > > >> > > to the workers, but could also be used by customers to
> > intentionally
> > > >> > > downgrade their cluster.
> > > >> > >
> > > >> > > This kind of attack actually doesn't require access to any of
> the
> > > >> Connect
> > > >> > > workers; group coordination is managed by the Kafka cluster. So,
> > > even
> > > >> if
> > > >> > > your entire Connect cluster crashes and then a malicious user
> > tries
> > > to
> > > >> > join
> > > >> > > the group, as long as your Kafka broker has restricted access to
> > the
> > > >> > group
> > > >> > > used by your Connect cluster, that attack would still fail.
> > > >> > >
> > > >> > > > You mentioned that this security improvement only adds
> security
> > > when
> > > >> > > there
> > > >> > > are a number of existing security changes in place, one of these
> > > being
> > > >> > ACLs
> > > >> > > on the Kafka broker. Do you plan to gate this feature roll-out
> on
> > > >> > detecting
> > > >> > > that those prerequisite features are enabled?
> > > >> > >
> > > >> > > I wasn't planning on it. With the pluggable nature of Kafka and
> > > Kafka
> > > >> > > Connect authorization, I imagine it'd be pretty difficult to
> know
> > if
> > > >> all
> > > >> > of
> > > >> > > the important resources (worker group, internal topics, and
> Kafka
> > > >> Connect
> > > >> > > REST API are a few that come to mind) are actually secured.
> > > >> Additionally,
> > > >> > > the presence of other attack vectors shouldn't be justification
> > for
> > > >> > opening
> > > >> > > up a new one; it seems particularly dangers in the event that
> > > someone
> > > >> > > accidentally misconfigures their Connect cluster and this
> endpoint
> > > >> ends
> > > >> > up
> > > >> > > being exposed even though the user believes it to be protected.
> > > >> > >
> > > >> > > > If not, I think this feature adds no additional security to
> > > >> unsecured
> > > >> > > clusters, while still incurring
> > > >> > > the overhead on signing requests.
> > > >> > >
> > > >> > > This seems like a bit of an overstatement. The changes wouldn't
> > add
> > > no
> > > >> > > additional security; I think restricting even accidental access
> to
> > > the
> > > >> > > internal REST endpoint is a good thing since it's not part of
> the
> > > >> public
> > > >> > > API and it's all but guaranteed that if someone makes a request
> to
> > > >> that
> > > >> > > endpoint they're either confused or malicious. Adding another
> > hurdle
> > > >> in
> > > >> > the
> > > >> > > way for malicious users is also beneficial, even if it's not a
> > > >> > > comprehensive solution.
> > > >> > >
> > > >> > > > Do you know how significant the overhead will be, as a rough
> > > >> estimate?
> > > >> > >
> > > >> > > Not very significant. The overhead would only be incurred when
> > > >> followers
> > > >> > > need to relay task configurations to the leader, which shouldn't
> > > >> happen
> > > >> > > very frequently in any stable connector. If that is happening
> > > >> frequently,
> > > >> > > the overhead for starting new tasks or stopping/reconfiguring
> > > existing
> > > >> > > tasks would probably be orders of magnitude higher anyways.
> > > >> > >
> > > >> > >
> > > >> > > Thanks again to Magesh, Konstantine, and Greg for your feedback.
> > > I'll
> > > >> be
> > > >> > > updating the KIP to reflect changes mentioned here and to
> include
> > > >> answers
> > > >> > > to some of the questions raised; looking forward to the next
> round
> > > of
> > > >> > > comments.
> > > >> > >
> > > >> > > Cheers,
> > > >> > >
> > > >> > > Chris
> > > >> > >
> > > >> > > On Wed, Aug 28, 2019 at 9:02 AM Greg Harris <gr...@confluent.io
> >
> > > >> wrote:
> > > >> > >
> > > >> > >> Hey Chris,
> > > >> > >>
> > > >> > >> The KIP makes sense to me, and I think it's a very natural way
> to
> > > >> solve
> > > >> > >> the
> > > >> > >> issue at hand. I had two questions about the automatic rollout
> > > logic.
> > > >> > >>
> > > >> > >> Does this roll-out ratchet the protocol version irreversibly?
> > What
> > > is
> > > >> > the
> > > >> > >> expected behavior when the feature has been enabled, and a
> worker
> > > >> joins
> > > >> > >> the
> > > >> > >> group while advertising an old protocol? I think it's
> reasonable
> > to
> > > >> > >> prevent
> > > >> > >> single-worker downgrades, since this is an attack vector. But
> > what
> > > >> > happens
> > > >> > >> if every worker in a cluster goes down, and at least the first
> > > worker
> > > >> > >> comes
> > > >> > >> back with an old protocol? This is a potential attack that
> > requires
> > > >> > access
> > > >> > >> to the workers, but could also be used by customers to
> > > intentionally
> > > >> > >> downgrade their cluster.
> > > >> > >>
> > > >> > >> You mentioned that this security improvement only adds security
> > > when
> > > >> > there
> > > >> > >> are a number of existing security changes in place, one of
> these
> > > >> being
> > > >> > >> ACLs
> > > >> > >> on the Kafka broker. Do you plan to gate this feature roll-out
> on
> > > >> > >> detecting
> > > >> > >> that those prerequisite features are enabled? If not, I think
> > this
> > > >> > feature
> > > >> > >> adds no additional security to unsecured clusters, while still
> > > >> incurring
> > > >> > >> the overhead on signing requests. Do you know how significant
> the
> > > >> > overhead
> > > >> > >> will be, as a rough estimate?
> > > >> > >>
> > > >> > >> Looks great!
> > > >> > >>
> > > >> > >> Thanks,
> > > >> > >> Greg
> > > >> > >>
> > > >> > >>
> > > >> > >> On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis <
> > > >> > >> konstant...@confluent.io> wrote:
> > > >> > >>
> > > >> > >> > Thanks for the KIP Chris!
> > > >> > >> >
> > > >> > >> > This proposal will fill a gap in Kafka Connect deployments
> and
> > > will
> > > >> > >> > strengthen their production readiness in even more diverse
> > > >> > >> environments, in
> > > >> > >> > which current workarounds are less practical.
> > > >> > >> >
> > > >> > >> > I've read the discussions regarding why the rebalancing
> > protocol
> > > is
> > > >> > used
> > > >> > >> > here and your intention to follow the approach which was
> > recently
> > > >> used
> > > >> > >> in
> > > >> > >> > order to elegantly support upgrades without requiring rolling
> > > >> restarts
> > > >> > >> > makes sense.
> > > >> > >> >
> > > >> > >> > However, I'd like to revisit the proposed extension of the
> > > connect
> > > >> > >> protocol
> > > >> > >> > going a little bit more in detail.
> > > >> > >> > Indeed, broadcasting the session key to the followers is
> > > necessary.
> > > >> > But
> > > >> > >> > adding it to the configs topic with a new key is compatible
> > with
> > > >> > >> previous
> > > >> > >> > versions (the key will be ignored by workers that don't
> support
> > > >> this
> > > >> > >> > protocol) and works since all workers will need to read this
> > > topic
> > > >> to
> > > >> > >> the
> > > >> > >> > end to remain in the group.
> > > >> > >> >
> > > >> > >> > Additionally, to your point about consensus, meaning how the
> > > >> workers
> > > >> > all
> > > >> > >> > know during the same generation that the "sessioned" version
> of
> > > the
> > > >> > >> > protocol was chosen by the leader, it seems to me that an
> > > explicit
> > > >> > >> upgrade
> > > >> > >> > of the protocol on the wire, on its name and on the metadata
> > that
> > > >> are
> > > >> > >> sent
> > > >> > >> > with the join request by each worker might not be required.
> > What
> > > I
> > > >> > >> believe
> > > >> > >> > would suffice is merely an increment of the version of the
> > > >> protocol,
> > > >> > >> > because all the other information remains the same (for
> > example,
> > > >> the
> > > >> > >> list
> > > >> > >> > of assigned and revoked tasks, etc). This would allow us not
> to
> > > >> extend
> > > >> > >> the
> > > >> > >> > metadata even more with yet another JoinGroupRequest and
> could
> > > >> achieve
> > > >> > >> what
> > > >> > >> > you want in terms of an orchestrated upgrade.
> > > >> > >> >
> > > >> > >> > Without having exhaustively checked the various code paths, I
> > > feel
> > > >> > that
> > > >> > >> > such an approach would be less heavy-handed in terms of
> > extending
> > > >> the
> > > >> > >> > format of the connect protocol and would achieve a certain
> > > >> separation
> > > >> > of
> > > >> > >> > concerns between the information that is required for
> > rebalancing
> > > >> and
> > > >> > is
> > > >> > >> > included in the protocol metadata and the information that is
> > > >> required
> > > >> > >> for
> > > >> > >> > securing the internal Connect REST endpoint. In theory, this
> > > method
> > > >> > >> could
> > > >> > >> > be used to even secure the eager version of the protocol, but
> > I'd
> > > >> > agree
> > > >> > >> > that this out of the scope of the current proposal.
> > > >> > >> >
> > > >> > >> > All the leader worker would have to check is whether all the
> > > >> > assignments
> > > >> > >> > that it took were in "sessioned" version (possibly
> > > >> > >> CONNECT_PROTOCOL_V2=2 if
> > > >> > >> > you'd choose to go this route).
> > > >> > >> >
> > > >> > >> > Overall, this does not differ a lot from your current
> proposal,
> > > >> which
> > > >> > I
> > > >> > >> > think is already in the right direction.
> > > >> > >> > Let me know what you think.
> > > >> > >> >
> > > >> > >> > Cheers,
> > > >> > >> > Konstantine
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <
> > > >> > mage...@confluent.io
> > > >> > >> >
> > > >> > >> > wrote:
> > > >> > >> >
> > > >> > >> > > Hi Chris,
> > > >> > >> > >
> > > >> > >> > > Thanks a lot for the KIP. This will certainly be a useful
> > > >> feature. I
> > > >> > >> > would
> > > >> > >> > > have preferred to use the topic approach as well but I also
> > > >> > understand
> > > >> > >> > your
> > > >> > >> > > point of view about the operational complexity for
> upgrades.
> > If
> > > >> not
> > > >> > >> with
> > > >> > >> > > this KIP, I would certainly want to go that route at some
> > point
> > > >> in
> > > >> > the
> > > >> > >> > > future.
> > > >> > >> > >
> > > >> > >> > > As far as using the rebalance protocol goes, it would be
> > great
> > > if
> > > >> > you
> > > >> > >> > could
> > > >> > >> > > elaborate on what exactly would be the rebalance impact
> when
> > a
> > > >> key
> > > >> > >> > expires.
> > > >> > >> > > I see that you have called out saying that there should be
> no
> > > >> > >> significant
> > > >> > >> > > impact but it will be great to explicitly state as what is
> to
> > > be
> > > >> > >> > expected.
> > > >> > >> > > I would prefer to not have any sorts of rebalancing when
> this
> > > >> > happens
> > > >> > >> > since
> > > >> > >> > > the connector and task assignments should not change with
> > this
> > > >> > event.
> > > >> > >> It
> > > >> > >> > > will be useful to explain this a bit more.
> > > >> > >> > >
> > > >> > >> > > Thanks,
> > > >> > >> > > Magesh
> > > >> > >> > >
> > > >> > >> > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <
> > > >> chr...@confluent.io>
> > > >> > >> > wrote:
> > > >> > >> > >
> > > >> > >> > > > Hi all,
> > > >> > >> > > >
> > > >> > >> > > > I've made some tweaks to the KIP that I believe are
> > > >> improvements.
> > > >> > >> More
> > > >> > >> > > > detail can be found on the KIP page itself, but as a
> brief
> > > >> > summary,
> > > >> > >> the
> > > >> > >> > > > three changes are:
> > > >> > >> > > >
> > > >> > >> > > > - The removal of the internal.request.verification
> property
> > > in
> > > >> > >> favor of
> > > >> > >> > > > modifying the default value for the connect.protocol
> > property
> > > >> from
> > > >> > >> > > > "compatible" to "sessioned"
> > > >> > >> > > > - The renaming of some configurations to use better
> > > terminology
> > > >> > >> (mostly
> > > >> > >> > > > just "request" instead of "key" where appropriate)
> > > >> > >> > > > - The addition of two new configurations that dictate how
> > > >> session
> > > >> > >> keys
> > > >> > >> > > are
> > > >> > >> > > > to be generated
> > > >> > >> > > >
> > > >> > >> > > > Thanks Ryanne for the feedback so far, hope to hear from
> > some
> > > >> more
> > > >> > >> of
> > > >> > >> > you
> > > >> > >> > > > soon :)
> > > >> > >> > > >
> > > >> > >> > > > Cheers,
> > > >> > >> > > >
> > > >> > >> > > > Chris
> > > >> > >> > > >
> > > >> > >> > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <
> > > >> > chr...@confluent.io
> > > >> > >> >
> > > >> > >> > > > wrote:
> > > >> > >> > > >
> > > >> > >> > > > > Hi Ryanne,
> > > >> > >> > > > >
> > > >> > >> > > > > The reasoning for this is provided in the KIP: "There
> > would
> > > >> be
> > > >> > no
> > > >> > >> > clear
> > > >> > >> > > > > way to achieve consensus amongst the workers in a
> cluster
> > > on
> > > >> > >> whether
> > > >> > >> > to
> > > >> > >> > > > > switch to this new behavior." To elaborate on this--it
> > > would
> > > >> be
> > > >> > >> bad
> > > >> > >> > if
> > > >> > >> > > a
> > > >> > >> > > > > follower worker began writing task configurations to
> the
> > > >> topic
> > > >> > >> while
> > > >> > >> > > the
> > > >> > >> > > > > leader continued to only listen on the internal REST
> > > >> endpoint.
> > > >> > >> It's
> > > >> > >> > > > > necessary to ensure that every worker in a cluster
> > supports
> > > >> this
> > > >> > >> new
> > > >> > >> > > > > behavior before switching it on.
> > > >> > >> > > > >
> > > >> > >> > > > > To be fair, it is technically possible to achieve
> > consensus
> > > >> > >> without
> > > >> > >> > > using
> > > >> > >> > > > > the group coordination protocol by adding new
> > > configurations
> > > >> and
> > > >> > >> > using
> > > >> > >> > > a
> > > >> > >> > > > > multi-phase rolling upgrade, but the operational
> > complexity
> > > >> of
> > > >> > >> this
> > > >> > >> > > > > approach would significantly complicate things for the
> > > >> default
> > > >> > >> case
> > > >> > >> > of
> > > >> > >> > > > just
> > > >> > >> > > > > wanting to bump your Connect cluster to the next
> version
> > > >> without
> > > >> > >> > having
> > > >> > >> > > > to
> > > >> > >> > > > > alter your existing worker config files and with only a
> > > >> single
> > > >> > >> > restart
> > > >> > >> > > > per
> > > >> > >> > > > > worker. The proposed approach provides a much simpler
> > user
> > > >> > >> experience
> > > >> > >> > > for
> > > >> > >> > > > > the most common use case and doesn't impose much
> > additional
> > > >> > >> > complexity
> > > >> > >> > > > for
> > > >> > >> > > > > other use cases.
> > > >> > >> > > > >
> > > >> > >> > > > > I've updated the KIP to expand on points from your last
> > > >> emails;
> > > >> > >> let
> > > >> > >> > me
> > > >> > >> > > > > know what you think.
> > > >> > >> > > > >
> > > >> > >> > > > > Cheers,
> > > >> > >> > > > >
> > > >> > >> > > > > Chris
> > > >> > >> > > > >
> > > >> > >> > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan <
> > > >> > >> ryannedo...@gmail.com>
> > > >> > >> > > > > wrote:
> > > >> > >> > > > >
> > > >> > >> > > > >> Thanks Chris, that makes sense.
> > > >> > >> > > > >>
> > > >> > >> > > > >> I know you have already considered this, but I'm not
> > > >> convinced
> > > >> > we
> > > >> > >> > > should
> > > >> > >> > > > >> rule out using Kafka topics for this purpose. That
> would
> > > >> enable
> > > >> > >> the
> > > >> > >> > > same
> > > >> > >> > > > >> level of security without introducing any new
> > > >> authentication or
> > > >> > >> > > > >> authorization mechanisms (your keys). And as you say,
> > > you'd
> > > >> > need
> > > >> > >> to
> > > >> > >> > > lock
> > > >> > >> > > > >> down Connect's topics and groups anyway.
> > > >> > >> > > > >>
> > > >> > >> > > > >> Can you explain what you mean when you say using Kafka
> > > >> topics
> > > >> > >> would
> > > >> > >> > > > >> require
> > > >> > >> > > > >> "reworking the group coordination protocol"? I don't
> see
> > > how
> > > >> > >> these
> > > >> > >> > are
> > > >> > >> > > > >> related. Why would it matter if the workers sent Kafka
> > > >> messages
> > > >> > >> vs
> > > >> > >> > > POST
> > > >> > >> > > > >> requests among themselves?
> > > >> > >> > > > >>
> > > >> > >> > > > >> Ryanne
> > > >> > >> > > > >>
> > > >> > >> > > > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton <
> > > >> > >> chr...@confluent.io>
> > > >> > >> > > > >> wrote:
> > > >> > >> > > > >>
> > > >> > >> > > > >> > Hi Ryanne,
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > Yes, if the Connect group is left unsecured then
> that
> > > is a
> > > >> > >> > potential
> > > >> > >> > > > >> > vulnerability. However, in a truly secure Connect
> > > cluster,
> > > >> > the
> > > >> > >> > group
> > > >> > >> > > > >> would
> > > >> > >> > > > >> > need to be secured anyways to prevent attackers from
> > > >> joining
> > > >> > >> the
> > > >> > >> > > group
> > > >> > >> > > > >> with
> > > >> > >> > > > >> > the intent to either snoop on connector/task
> > > >> configurations
> > > >> > or
> > > >> > >> > bring
> > > >> > >> > > > the
> > > >> > >> > > > >> > cluster to a halt by spamming the group with
> > membership
> > > >> > >> requests
> > > >> > >> > and
> > > >> > >> > > > >> then
> > > >> > >> > > > >> > not running the assigned connectors/tasks.
> > Additionally,
> > > >> for
> > > >> > a
> > > >> > >> > > Connect
> > > >> > >> > > > >> > cluster to be secure, access to internal topics (for
> > > >> configs,
> > > >> > >> > > offsets,
> > > >> > >> > > > >> and
> > > >> > >> > > > >> > statuses) would also need to be restricted so that
> > > >> attackers
> > > >> > >> could
> > > >> > >> > > > not,
> > > >> > >> > > > >> > e.g., write arbitrary connector/task configurations
> to
> > > the
> > > >> > >> configs
> > > >> > >> > > > >> topic.
> > > >> > >> > > > >> > This is all currently possible in Kafka with the use
> > of
> > > >> ACLs.
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > I think the bottom line here is that there's a
> number
> > of
> > > >> > steps
> > > >> > >> > that
> > > >> > >> > > > >> need to
> > > >> > >> > > > >> > be taken to effectively lock down a Connect cluster;
> > the
> > > >> > point
> > > >> > >> of
> > > >> > >> > > this
> > > >> > >> > > > >> KIP
> > > >> > >> > > > >> > is to close a loophole that exists even after all of
> > > those
> > > >> > >> steps
> > > >> > >> > are
> > > >> > >> > > > >> taken,
> > > >> > >> > > > >> > not to completely secure this one vulnerability even
> > > when
> > > >> no
> > > >> > >> other
> > > >> > >> > > > >> security
> > > >> > >> > > > >> > measures are taken.
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > Cheers,
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > Chris
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan <
> > > >> > >> > > ryannedo...@gmail.com>
> > > >> > >> > > > >> > wrote:
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > > Chris, I don't understand how the rebalance
> protocol
> > > >> can be
> > > >> > >> used
> > > >> > >> > > to
> > > >> > >> > > > >> give
> > > >> > >> > > > >> > > out session tokens in a secure way. It seems that
> > any
> > > >> > >> attacker
> > > >> > >> > > could
> > > >> > >> > > > >> just
> > > >> > >> > > > >> > > join the group and sign requests with the provided
> > > >> token.
> > > >> > Am
> > > >> > >> I
> > > >> > >> > > > missing
> > > >> > >> > > > >> > > something?
> > > >> > >> > > > >> > >
> > > >> > >> > > > >> > > Ryanne
> > > >> > >> > > > >> > >
> > > >> > >> > > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton <
> > > >> > >> > chr...@confluent.io>
> > > >> > >> > > > >> wrote:
> > > >> > >> > > > >> > >
> > > >> > >> > > > >> > > > The KIP page can be found at
> > > >> > >> > > > >> > > >
> > > >> > >> > > > >> > > >
> > > >> > >> > > > >> > >
> > > >> > >> > > > >> >
> > > >> > >> > > > >>
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > > >> > >> > > > >> > > > ,
> > > >> > >> > > > >> > > > by the way. Apologies for neglecting to include
> it
> > > in
> > > >> my
> > > >> > >> > initial
> > > >> > >> > > > >> email!
> > > >> > >> > > > >> > > >
> > > >> > >> > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton <
> > > >> > >> > > > chr...@confluent.io
> > > >> > >> > > > >> >
> > > >> > >> > > > >> > > > wrote:
> > > >> > >> > > > >> > > >
> > > >> > >> > > > >> > > > > Hi all,
> > > >> > >> > > > >> > > > >
> > > >> > >> > > > >> > > > > I'd like to start discussion on a KIP to
> secure
> > > the
> > > >> > >> internal
> > > >> > >> > > > "POST
> > > >> > >> > > > >> > > > > /connectors/<name>/tasks" endpoint for the
> > Connect
> > > >> > >> > framework.
> > > >> > >> > > > The
> > > >> > >> > > > >> > > > proposed
> > > >> > >> > > > >> > > > > changes address a vulnerability in the
> framework
> > > in
> > > >> its
> > > >> > >> > > current
> > > >> > >> > > > >> state
> > > >> > >> > > > >> > > > that
> > > >> > >> > > > >> > > > > allows malicious users to write arbitrary task
> > > >> > >> > configurations
> > > >> > >> > > > for
> > > >> > >> > > > >> > > > > connectors; it is vital that this issue be
> > > >> addressed in
> > > >> > >> > order
> > > >> > >> > > > for
> > > >> > >> > > > >> any
> > > >> > >> > > > >> > > > > Connect cluster to be secure.
> > > >> > >> > > > >> > > > >
> > > >> > >> > > > >> > > > > Looking forward to your thoughts,
> > > >> > >> > > > >> > > > >
> > > >> > >> > > > >> > > > > Chris
> > > >> > >> > > > >> > > > >
> > > >> > >> > > > >> > > >
> > > >> > >> > > > >> > >
> > > >> > >> > > > >> >
> > > >> > >> > > > >>
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

Reply via email to