Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

Chris Egerton Fri, 06 Sep 2019 14:49:44 -0700

Hi all,

I've published a draft PR implementing the changes currently proposed in
KIP-507, which can be found at https://github.com/apache/kafka/pull/7310.
Thanks for all of the review and feedback so far! Given the lack of any
major voiced reservations so far, if I don't hear anything else over the
course of the next few days or so I'll be opening this for a vote.


Cheers,

Chris


On Wed, Aug 28, 2019 at 12:21 PM Chris Egerton <chr...@confluent.io> wrote:

> HI all,
>
> Wow, thanks for all the feedback! Happy to see this proposal getting some
> love :)
>
>
> RE Konstantine's comments:
>
> > I've read the discussions regarding why the rebalancing protocol is used
> here and your intention to follow the approach which was recently used in
> order to elegantly support upgrades without requiring rolling restarts
> makes sense.
> > However, I'd like to revisit the proposed extension of the connect
> protocol
> going a little bit more in detail.
> Indeed, broadcasting the session key to the followers is necessary. But
> adding it to the configs topic with a new key is compatible with previous
> versions (the key will be ignored by workers that don't support this
> protocol) and works since all workers will need to read this topic to the
> end to remain in the group.
> > Additionally, to your point about consensus, meaning how the workers all
> know during the same generation that the "sessioned" version of the
> protocol was chosen by the leader, it seems to me that an explicit upgrade
> of the protocol on the wire, on its name and on the metadata that are sent
> with the join request by each worker might not be required. What I believe
> would suffice is merely an increment of the version of the protocol,
> because all the other information remains the same (for example, the list
> of assigned and revoked tasks, etc). This would allow us not to extend the
> metadata even more with yet another JoinGroupRequest and could achieve what
> you want in terms of an orchestrated upgrade.
>
> I really like this idea, thanks for suggesting it. A simple bump of the
> protocol version does seem sufficient to achieve consensus among workers
> and would significantly reduce the implementation complexity of an entirely
> separate protocol just for the sake of distributing keys, and using the
> config topic to relay keys instead of the group coordination protocol would
> prevent rebalances solely for the sake of key rotation (instead, the leader
> can just periodically write a new key to the config topic). I'll update the
> KIP by EOD today with these changes.
>
> > All the leader worker would have to check is whether all the assignments
> that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
> you'd choose to go this route).
>
> I like this idea for dictating whether the leader requires request
> signing, SGTM. I think we can have all workers use the latest session key
> present in the config topic (if there is one) to sign their requests; since
> the signatures are passed via header, this shouldn't cause any problems if
> a follower signs a request that gets sent to a leader that isn't performing
> request validation.
>
>
> RE Magesh's comments:
>
> > Thanks a lot for the KIP. This will certainly be a useful feature. I
> would
> have preferred to use the topic approach as well but I also understand your
> point of view about the operational complexity for upgrades. If not with
> this KIP, I would certainly want to go that route at some point in the
> future.
>
> I'm actually a little conflicted on this front. If this KIP passes and we
> have a secure way to perform quick, synchronous, follower-to-leader
> communication, it may be better to hold onto and even perhaps expand on in
> case this functionality comes in handy later. Additionally, there would
> have to be a lot of thought put into the semantics of using topic-based
> communication for relaying task configs to the leader due to its
> asynchronous nature; I'm not convinced it's not worth it, but I'm also not
> entirely convinced it is.
>
> > As far as using the rebalance protocol goes, it would be great if you
> could
> elaborate on what exactly would be the rebalance impact when a key expires.
> I see that you have called out saying that there should be no significant
> impact but it will be great to explicitly state as what is to be expected.
> I would prefer to not have any sorts of rebalancing when this happens since
> the connector and task assignments should not change with this event. It
> will be useful to explain this a bit more.
>
> Given Konstantine's comments, this question isn't strictly relevant
> anymore (unless we decide to return to the approach of distributing session
> keys during rebalance), but just for completeness it does seem worth
> addressing. The "rebalances" proposed for key rotation would be achieved by
> a request from the leader the rejoin the group. This would immediately
> result in the leader being asked to assign connectors and tasks to the
> group, which would exactly match the existing assignments, so with that and
> the help of the new incremental cooperative rebalance semantics there would
> be zero downtime for connectors or tasks. Truly, "rebalance" is a bit of a
> misnomer here; we're really talking about performing group assignment
> (which doesn't necessarily imply adding or removing connectors/tasks
> to/from any workers), but as far as I've seen the two are basically used
> synonymously w/r/t Connect so I elected to use the more common, albeit
> slightly less accurate, term.
>
>
> RE Greg's comments:
>
> > Does this roll-out ratchet the protocol version irreversibly?
>
> Nope, the rollout will be achieved by bumping the protocol (or, after
> adjusting for Konstantine's suggestion, just the protocol version) for each
> worker, but the Kafka group coordinator will only begin using that new
> protocol if every worker indicates that that protocol is supported when it
> joins the group. So, if a cluster existed with workers that all supported
> protocols 1 and 2 and indicated a preference for protocol 2, the cluster
> would use protocol 2. However, if a new worker that only supported protocol
> 1 joined the cluster, the entire cluster would be bumped back down to
> protocol 1. If/when that worker leaves or indicates it supports protocol 2,
> the cluster would bump back up to protocol 2.
>
> >What is the expected behavior when the feature has been enabled, and a
> worker joins the
> group while advertising an old protocol?
>
> I partially answered this above; the entire cluster would downgrade its
> protocol version. The practical effect of this would be that request
> signature verification by the leader would immediately cease.
>
> > I think it's reasonable to prevent single-worker downgrades, since this
> is an attack vector.
> Actually, this is fine. Groups are a protected resource in Kafka; if you
> want to ensure this doesn't happen, just secure your Kafka broker and use
> something like ACLs to make sure only authorized entities can join your
> group.
>
> > But what happens if every worker in a cluster goes down, and at least
> the first worker comes
> back with an old protocol? This is a potential attack that requires access
> to the workers, but could also be used by customers to intentionally
> downgrade their cluster.
>
> This kind of attack actually doesn't require access to any of the Connect
> workers; group coordination is managed by the Kafka cluster. So, even if
> your entire Connect cluster crashes and then a malicious user tries to join
> the group, as long as your Kafka broker has restricted access to the group
> used by your Connect cluster, that attack would still fail.
>
> > You mentioned that this security improvement only adds security when
> there
> are a number of existing security changes in place, one of these being ACLs
> on the Kafka broker. Do you plan to gate this feature roll-out on detecting
> that those prerequisite features are enabled?
>
> I wasn't planning on it. With the pluggable nature of Kafka and Kafka
> Connect authorization, I imagine it'd be pretty difficult to know if all of
> the important resources (worker group, internal topics, and Kafka Connect
> REST API are a few that come to mind) are actually secured. Additionally,
> the presence of other attack vectors shouldn't be justification for opening
> up a new one; it seems particularly dangers in the event that someone
> accidentally misconfigures their Connect cluster and this endpoint ends up
> being exposed even though the user believes it to be protected.
>
> > If not, I think this feature adds no additional security to unsecured
> clusters, while still incurring
> the overhead on signing requests.
>
> This seems like a bit of an overstatement. The changes wouldn't add no
> additional security; I think restricting even accidental access to the
> internal REST endpoint is a good thing since it's not part of the public
> API and it's all but guaranteed that if someone makes a request to that
> endpoint they're either confused or malicious. Adding another hurdle in the
> way for malicious users is also beneficial, even if it's not a
> comprehensive solution.
>
> > Do you know how significant the overhead will be, as a rough estimate?
>
> Not very significant. The overhead would only be incurred when followers
> need to relay task configurations to the leader, which shouldn't happen
> very frequently in any stable connector. If that is happening frequently,
> the overhead for starting new tasks or stopping/reconfiguring existing
> tasks would probably be orders of magnitude higher anyways.
>
>
> Thanks again to Magesh, Konstantine, and Greg for your feedback. I'll be
> updating the KIP to reflect changes mentioned here and to include answers
> to some of the questions raised; looking forward to the next round of
> comments.
>
> Cheers,
>
> Chris
>
> On Wed, Aug 28, 2019 at 9:02 AM Greg Harris <gr...@confluent.io> wrote:
>
>> Hey Chris,
>>
>> The KIP makes sense to me, and I think it's a very natural way to solve
>> the
>> issue at hand. I had two questions about the automatic rollout logic.
>>
>> Does this roll-out ratchet the protocol version irreversibly? What is the
>> expected behavior when the feature has been enabled, and a worker joins
>> the
>> group while advertising an old protocol? I think it's reasonable to
>> prevent
>> single-worker downgrades, since this is an attack vector. But what happens
>> if every worker in a cluster goes down, and at least the first worker
>> comes
>> back with an old protocol? This is a potential attack that requires access
>> to the workers, but could also be used by customers to intentionally
>> downgrade their cluster.
>>
>> You mentioned that this security improvement only adds security when there
>> are a number of existing security changes in place, one of these being
>> ACLs
>> on the Kafka broker. Do you plan to gate this feature roll-out on
>> detecting
>> that those prerequisite features are enabled? If not, I think this feature
>> adds no additional security to unsecured clusters, while still incurring
>> the overhead on signing requests. Do you know how significant the overhead
>> will be, as a rough estimate?
>>
>> Looks great!
>>
>> Thanks,
>> Greg
>>
>>
>> On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis <
>> konstant...@confluent.io> wrote:
>>
>> > Thanks for the KIP Chris!
>> >
>> > This proposal will fill a gap in Kafka Connect deployments and will
>> > strengthen their production readiness in even more diverse
>> environments, in
>> > which current workarounds are less practical.
>> >
>> > I've read the discussions regarding why the rebalancing protocol is used
>> > here and your intention to follow the approach which was recently used
>> in
>> > order to elegantly support upgrades without requiring rolling restarts
>> > makes sense.
>> >
>> > However, I'd like to revisit the proposed extension of the connect
>> protocol
>> > going a little bit more in detail.
>> > Indeed, broadcasting the session key to the followers is necessary. But
>> > adding it to the configs topic with a new key is compatible with
>> previous
>> > versions (the key will be ignored by workers that don't support this
>> > protocol) and works since all workers will need to read this topic to
>> the
>> > end to remain in the group.
>> >
>> > Additionally, to your point about consensus, meaning how the workers all
>> > know during the same generation that the "sessioned" version of the
>> > protocol was chosen by the leader, it seems to me that an explicit
>> upgrade
>> > of the protocol on the wire, on its name and on the metadata that are
>> sent
>> > with the join request by each worker might not be required. What I
>> believe
>> > would suffice is merely an increment of the version of the protocol,
>> > because all the other information remains the same (for example, the
>> list
>> > of assigned and revoked tasks, etc). This would allow us not to extend
>> the
>> > metadata even more with yet another JoinGroupRequest and could achieve
>> what
>> > you want in terms of an orchestrated upgrade.
>> >
>> > Without having exhaustively checked the various code paths, I feel that
>> > such an approach would be less heavy-handed in terms of extending the
>> > format of the connect protocol and would achieve a certain separation of
>> > concerns between the information that is required for rebalancing and is
>> > included in the protocol metadata and the information that is required
>> for
>> > securing the internal Connect REST endpoint. In theory, this method
>> could
>> > be used to even secure the eager version of the protocol, but I'd agree
>> > that this out of the scope of the current proposal.
>> >
>> > All the leader worker would have to check is whether all the assignments
>> > that it took were in "sessioned" version (possibly
>> CONNECT_PROTOCOL_V2=2 if
>> > you'd choose to go this route).
>> >
>> > Overall, this does not differ a lot from your current proposal, which I
>> > think is already in the right direction.
>> > Let me know what you think.
>> >
>> > Cheers,
>> > Konstantine
>> >
>> >
>> > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <mage...@confluent.io
>> >
>> > wrote:
>> >
>> > > Hi Chris,
>> > >
>> > > Thanks a lot for the KIP. This will certainly be a useful feature. I
>> > would
>> > > have preferred to use the topic approach as well but I also understand
>> > your
>> > > point of view about the operational complexity for upgrades. If not
>> with
>> > > this KIP, I would certainly want to go that route at some point in the
>> > > future.
>> > >
>> > > As far as using the rebalance protocol goes, it would be great if you
>> > could
>> > > elaborate on what exactly would be the rebalance impact when a key
>> > expires.
>> > > I see that you have called out saying that there should be no
>> significant
>> > > impact but it will be great to explicitly state as what is to be
>> > expected.
>> > > I would prefer to not have any sorts of rebalancing when this happens
>> > since
>> > > the connector and task assignments should not change with this event.
>> It
>> > > will be useful to explain this a bit more.
>> > >
>> > > Thanks,
>> > > Magesh
>> > >
>> > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <chr...@confluent.io>
>> > wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > I've made some tweaks to the KIP that I believe are improvements.
>> More
>> > > > detail can be found on the KIP page itself, but as a brief summary,
>> the
>> > > > three changes are:
>> > > >
>> > > > - The removal of the internal.request.verification property in
>> favor of
>> > > > modifying the default value for the connect.protocol property from
>> > > > "compatible" to "sessioned"
>> > > > - The renaming of some configurations to use better terminology
>> (mostly
>> > > > just "request" instead of "key" where appropriate)
>> > > > - The addition of two new configurations that dictate how session
>> keys
>> > > are
>> > > > to be generated
>> > > >
>> > > > Thanks Ryanne for the feedback so far, hope to hear from some more
>> of
>> > you
>> > > > soon :)
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Chris
>> > > >
>> > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <chr...@confluent.io
>> >
>> > > > wrote:
>> > > >
>> > > > > Hi Ryanne,
>> > > > >
>> > > > > The reasoning for this is provided in the KIP: "There would be no
>> > clear
>> > > > > way to achieve consensus amongst the workers in a cluster on
>> whether
>> > to
>> > > > > switch to this new behavior." To elaborate on this--it would be
>> bad
>> > if
>> > > a
>> > > > > follower worker began writing task configurations to the topic
>> while
>> > > the
>> > > > > leader continued to only listen on the internal REST endpoint.
>> It's
>> > > > > necessary to ensure that every worker in a cluster supports this
>> new
>> > > > > behavior before switching it on.
>> > > > >
>> > > > > To be fair, it is technically possible to achieve consensus
>> without
>> > > using
>> > > > > the group coordination protocol by adding new configurations and
>> > using
>> > > a
>> > > > > multi-phase rolling upgrade, but the operational complexity of
>> this
>> > > > > approach would significantly complicate things for the default
>> case
>> > of
>> > > > just
>> > > > > wanting to bump your Connect cluster to the next version without
>> > having
>> > > > to
>> > > > > alter your existing worker config files and with only a single
>> > restart
>> > > > per
>> > > > > worker. The proposed approach provides a much simpler user
>> experience
>> > > for
>> > > > > the most common use case and doesn't impose much additional
>> > complexity
>> > > > for
>> > > > > other use cases.
>> > > > >
>> > > > > I've updated the KIP to expand on points from your last emails;
>> let
>> > me
>> > > > > know what you think.
>> > > > >
>> > > > > Cheers,
>> > > > >
>> > > > > Chris
>> > > > >
>> > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan <
>> ryannedo...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Thanks Chris, that makes sense.
>> > > > >>
>> > > > >> I know you have already considered this, but I'm not convinced we
>> > > should
>> > > > >> rule out using Kafka topics for this purpose. That would enable
>> the
>> > > same
>> > > > >> level of security without introducing any new authentication or
>> > > > >> authorization mechanisms (your keys). And as you say, you'd need
>> to
>> > > lock
>> > > > >> down Connect's topics and groups anyway.
>> > > > >>
>> > > > >> Can you explain what you mean when you say using Kafka topics
>> would
>> > > > >> require
>> > > > >> "reworking the group coordination protocol"? I don't see how
>> these
>> > are
>> > > > >> related. Why would it matter if the workers sent Kafka messages
>> vs
>> > > POST
>> > > > >> requests among themselves?
>> > > > >>
>> > > > >> Ryanne
>> > > > >>
>> > > > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton <
>> chr...@confluent.io>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Hi Ryanne,
>> > > > >> >
>> > > > >> > Yes, if the Connect group is left unsecured then that is a
>> > potential
>> > > > >> > vulnerability. However, in a truly secure Connect cluster, the
>> > group
>> > > > >> would
>> > > > >> > need to be secured anyways to prevent attackers from joining
>> the
>> > > group
>> > > > >> with
>> > > > >> > the intent to either snoop on connector/task configurations or
>> > bring
>> > > > the
>> > > > >> > cluster to a halt by spamming the group with membership
>> requests
>> > and
>> > > > >> then
>> > > > >> > not running the assigned connectors/tasks. Additionally, for a
>> > > Connect
>> > > > >> > cluster to be secure, access to internal topics (for configs,
>> > > offsets,
>> > > > >> and
>> > > > >> > statuses) would also need to be restricted so that attackers
>> could
>> > > > not,
>> > > > >> > e.g., write arbitrary connector/task configurations to the
>> configs
>> > > > >> topic.
>> > > > >> > This is all currently possible in Kafka with the use of ACLs.
>> > > > >> >
>> > > > >> > I think the bottom line here is that there's a number of steps
>> > that
>> > > > >> need to
>> > > > >> > be taken to effectively lock down a Connect cluster; the point
>> of
>> > > this
>> > > > >> KIP
>> > > > >> > is to close a loophole that exists even after all of those
>> steps
>> > are
>> > > > >> taken,
>> > > > >> > not to completely secure this one vulnerability even when no
>> other
>> > > > >> security
>> > > > >> > measures are taken.
>> > > > >> >
>> > > > >> > Cheers,
>> > > > >> >
>> > > > >> > Chris
>> > > > >> >
>> > > > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan <
>> > > ryannedo...@gmail.com>
>> > > > >> > wrote:
>> > > > >> >
>> > > > >> > > Chris, I don't understand how the rebalance protocol can be
>> used
>> > > to
>> > > > >> give
>> > > > >> > > out session tokens in a secure way. It seems that any
>> attacker
>> > > could
>> > > > >> just
>> > > > >> > > join the group and sign requests with the provided token. Am
>> I
>> > > > missing
>> > > > >> > > something?
>> > > > >> > >
>> > > > >> > > Ryanne
>> > > > >> > >
>> > > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton <
>> > chr...@confluent.io>
>> > > > >> wrote:
>> > > > >> > >
>> > > > >> > > > The KIP page can be found at
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
>> > > > >> > > > ,
>> > > > >> > > > by the way. Apologies for neglecting to include it in my
>> > initial
>> > > > >> email!
>> > > > >> > > >
>> > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton <
>> > > > chr...@confluent.io
>> > > > >> >
>> > > > >> > > > wrote:
>> > > > >> > > >
>> > > > >> > > > > Hi all,
>> > > > >> > > > >
>> > > > >> > > > > I'd like to start discussion on a KIP to secure the
>> internal
>> > > > "POST
>> > > > >> > > > > /connectors/<name>/tasks" endpoint for the Connect
>> > framework.
>> > > > The
>> > > > >> > > > proposed
>> > > > >> > > > > changes address a vulnerability in the framework in its
>> > > current
>> > > > >> state
>> > > > >> > > > that
>> > > > >> > > > > allows malicious users to write arbitrary task
>> > configurations
>> > > > for
>> > > > >> > > > > connectors; it is vital that this issue be addressed in
>> > order
>> > > > for
>> > > > >> any
>> > > > >> > > > > Connect cluster to be secure.
>> > > > >> > > > >
>> > > > >> > > > > Looking forward to your thoughts,
>> > > > >> > > > >
>> > > > >> > > > > Chris
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

Reply via email to