HI all,

Wow, thanks for all the feedback! Happy to see this proposal getting some
love :)


RE Konstantine's comments:

> I've read the discussions regarding why the rebalancing protocol is used
here and your intention to follow the approach which was recently used in
order to elegantly support upgrades without requiring rolling restarts
makes sense.
> However, I'd like to revisit the proposed extension of the connect
protocol
going a little bit more in detail.
Indeed, broadcasting the session key to the followers is necessary. But
adding it to the configs topic with a new key is compatible with previous
versions (the key will be ignored by workers that don't support this
protocol) and works since all workers will need to read this topic to the
end to remain in the group.
> Additionally, to your point about consensus, meaning how the workers all
know during the same generation that the "sessioned" version of the
protocol was chosen by the leader, it seems to me that an explicit upgrade
of the protocol on the wire, on its name and on the metadata that are sent
with the join request by each worker might not be required. What I believe
would suffice is merely an increment of the version of the protocol,
because all the other information remains the same (for example, the list
of assigned and revoked tasks, etc). This would allow us not to extend the
metadata even more with yet another JoinGroupRequest and could achieve what
you want in terms of an orchestrated upgrade.

I really like this idea, thanks for suggesting it. A simple bump of the
protocol version does seem sufficient to achieve consensus among workers
and would significantly reduce the implementation complexity of an entirely
separate protocol just for the sake of distributing keys, and using the
config topic to relay keys instead of the group coordination protocol would
prevent rebalances solely for the sake of key rotation (instead, the leader
can just periodically write a new key to the config topic). I'll update the
KIP by EOD today with these changes.

> All the leader worker would have to check is whether all the assignments
that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
you'd choose to go this route).

I like this idea for dictating whether the leader requires request signing,
SGTM. I think we can have all workers use the latest session key present in
the config topic (if there is one) to sign their requests; since the
signatures are passed via header, this shouldn't cause any problems if a
follower signs a request that gets sent to a leader that isn't performing
request validation.


RE Magesh's comments:

> Thanks a lot for the KIP. This will certainly be a useful feature. I would
have preferred to use the topic approach as well but I also understand your
point of view about the operational complexity for upgrades. If not with
this KIP, I would certainly want to go that route at some point in the
future.

I'm actually a little conflicted on this front. If this KIP passes and we
have a secure way to perform quick, synchronous, follower-to-leader
communication, it may be better to hold onto and even perhaps expand on in
case this functionality comes in handy later. Additionally, there would
have to be a lot of thought put into the semantics of using topic-based
communication for relaying task configs to the leader due to its
asynchronous nature; I'm not convinced it's not worth it, but I'm also not
entirely convinced it is.

> As far as using the rebalance protocol goes, it would be great if you
could
elaborate on what exactly would be the rebalance impact when a key expires.
I see that you have called out saying that there should be no significant
impact but it will be great to explicitly state as what is to be expected.
I would prefer to not have any sorts of rebalancing when this happens since
the connector and task assignments should not change with this event. It
will be useful to explain this a bit more.

Given Konstantine's comments, this question isn't strictly relevant anymore
(unless we decide to return to the approach of distributing session keys
during rebalance), but just for completeness it does seem worth addressing.
The "rebalances" proposed for key rotation would be achieved by a request
from the leader the rejoin the group. This would immediately result in the
leader being asked to assign connectors and tasks to the group, which would
exactly match the existing assignments, so with that and the help of the
new incremental cooperative rebalance semantics there would be zero
downtime for connectors or tasks. Truly, "rebalance" is a bit of a misnomer
here; we're really talking about performing group assignment (which doesn't
necessarily imply adding or removing connectors/tasks to/from any workers),
but as far as I've seen the two are basically used synonymously w/r/t
Connect so I elected to use the more common, albeit slightly less accurate,
term.


RE Greg's comments:

> Does this roll-out ratchet the protocol version irreversibly?

Nope, the rollout will be achieved by bumping the protocol (or, after
adjusting for Konstantine's suggestion, just the protocol version) for each
worker, but the Kafka group coordinator will only begin using that new
protocol if every worker indicates that that protocol is supported when it
joins the group. So, if a cluster existed with workers that all supported
protocols 1 and 2 and indicated a preference for protocol 2, the cluster
would use protocol 2. However, if a new worker that only supported protocol
1 joined the cluster, the entire cluster would be bumped back down to
protocol 1. If/when that worker leaves or indicates it supports protocol 2,
the cluster would bump back up to protocol 2.

>What is the expected behavior when the feature has been enabled, and a
worker joins the
group while advertising an old protocol?

I partially answered this above; the entire cluster would downgrade its
protocol version. The practical effect of this would be that request
signature verification by the leader would immediately cease.

> I think it's reasonable to prevent single-worker downgrades, since this
is an attack vector.
Actually, this is fine. Groups are a protected resource in Kafka; if you
want to ensure this doesn't happen, just secure your Kafka broker and use
something like ACLs to make sure only authorized entities can join your
group.

> But what happens if every worker in a cluster goes down, and at least the
first worker comes
back with an old protocol? This is a potential attack that requires access
to the workers, but could also be used by customers to intentionally
downgrade their cluster.

This kind of attack actually doesn't require access to any of the Connect
workers; group coordination is managed by the Kafka cluster. So, even if
your entire Connect cluster crashes and then a malicious user tries to join
the group, as long as your Kafka broker has restricted access to the group
used by your Connect cluster, that attack would still fail.

> You mentioned that this security improvement only adds security when there
are a number of existing security changes in place, one of these being ACLs
on the Kafka broker. Do you plan to gate this feature roll-out on detecting
that those prerequisite features are enabled?

I wasn't planning on it. With the pluggable nature of Kafka and Kafka
Connect authorization, I imagine it'd be pretty difficult to know if all of
the important resources (worker group, internal topics, and Kafka Connect
REST API are a few that come to mind) are actually secured. Additionally,
the presence of other attack vectors shouldn't be justification for opening
up a new one; it seems particularly dangers in the event that someone
accidentally misconfigures their Connect cluster and this endpoint ends up
being exposed even though the user believes it to be protected.

> If not, I think this feature adds no additional security to unsecured
clusters, while still incurring
the overhead on signing requests.

This seems like a bit of an overstatement. The changes wouldn't add no
additional security; I think restricting even accidental access to the
internal REST endpoint is a good thing since it's not part of the public
API and it's all but guaranteed that if someone makes a request to that
endpoint they're either confused or malicious. Adding another hurdle in the
way for malicious users is also beneficial, even if it's not a
comprehensive solution.

> Do you know how significant the overhead will be, as a rough estimate?

Not very significant. The overhead would only be incurred when followers
need to relay task configurations to the leader, which shouldn't happen
very frequently in any stable connector. If that is happening frequently,
the overhead for starting new tasks or stopping/reconfiguring existing
tasks would probably be orders of magnitude higher anyways.


Thanks again to Magesh, Konstantine, and Greg for your feedback. I'll be
updating the KIP to reflect changes mentioned here and to include answers
to some of the questions raised; looking forward to the next round of
comments.

Cheers,

Chris

On Wed, Aug 28, 2019 at 9:02 AM Greg Harris <gr...@confluent.io> wrote:

> Hey Chris,
>
> The KIP makes sense to me, and I think it's a very natural way to solve the
> issue at hand. I had two questions about the automatic rollout logic.
>
> Does this roll-out ratchet the protocol version irreversibly? What is the
> expected behavior when the feature has been enabled, and a worker joins the
> group while advertising an old protocol? I think it's reasonable to prevent
> single-worker downgrades, since this is an attack vector. But what happens
> if every worker in a cluster goes down, and at least the first worker comes
> back with an old protocol? This is a potential attack that requires access
> to the workers, but could also be used by customers to intentionally
> downgrade their cluster.
>
> You mentioned that this security improvement only adds security when there
> are a number of existing security changes in place, one of these being ACLs
> on the Kafka broker. Do you plan to gate this feature roll-out on detecting
> that those prerequisite features are enabled? If not, I think this feature
> adds no additional security to unsecured clusters, while still incurring
> the overhead on signing requests. Do you know how significant the overhead
> will be, as a rough estimate?
>
> Looks great!
>
> Thanks,
> Greg
>
>
> On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis <
> konstant...@confluent.io> wrote:
>
> > Thanks for the KIP Chris!
> >
> > This proposal will fill a gap in Kafka Connect deployments and will
> > strengthen their production readiness in even more diverse environments,
> in
> > which current workarounds are less practical.
> >
> > I've read the discussions regarding why the rebalancing protocol is used
> > here and your intention to follow the approach which was recently used in
> > order to elegantly support upgrades without requiring rolling restarts
> > makes sense.
> >
> > However, I'd like to revisit the proposed extension of the connect
> protocol
> > going a little bit more in detail.
> > Indeed, broadcasting the session key to the followers is necessary. But
> > adding it to the configs topic with a new key is compatible with previous
> > versions (the key will be ignored by workers that don't support this
> > protocol) and works since all workers will need to read this topic to the
> > end to remain in the group.
> >
> > Additionally, to your point about consensus, meaning how the workers all
> > know during the same generation that the "sessioned" version of the
> > protocol was chosen by the leader, it seems to me that an explicit
> upgrade
> > of the protocol on the wire, on its name and on the metadata that are
> sent
> > with the join request by each worker might not be required. What I
> believe
> > would suffice is merely an increment of the version of the protocol,
> > because all the other information remains the same (for example, the list
> > of assigned and revoked tasks, etc). This would allow us not to extend
> the
> > metadata even more with yet another JoinGroupRequest and could achieve
> what
> > you want in terms of an orchestrated upgrade.
> >
> > Without having exhaustively checked the various code paths, I feel that
> > such an approach would be less heavy-handed in terms of extending the
> > format of the connect protocol and would achieve a certain separation of
> > concerns between the information that is required for rebalancing and is
> > included in the protocol metadata and the information that is required
> for
> > securing the internal Connect REST endpoint. In theory, this method could
> > be used to even secure the eager version of the protocol, but I'd agree
> > that this out of the scope of the current proposal.
> >
> > All the leader worker would have to check is whether all the assignments
> > that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2
> if
> > you'd choose to go this route).
> >
> > Overall, this does not differ a lot from your current proposal, which I
> > think is already in the right direction.
> > Let me know what you think.
> >
> > Cheers,
> > Konstantine
> >
> >
> > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <mage...@confluent.io>
> > wrote:
> >
> > > Hi Chris,
> > >
> > > Thanks a lot for the KIP. This will certainly be a useful feature. I
> > would
> > > have preferred to use the topic approach as well but I also understand
> > your
> > > point of view about the operational complexity for upgrades. If not
> with
> > > this KIP, I would certainly want to go that route at some point in the
> > > future.
> > >
> > > As far as using the rebalance protocol goes, it would be great if you
> > could
> > > elaborate on what exactly would be the rebalance impact when a key
> > expires.
> > > I see that you have called out saying that there should be no
> significant
> > > impact but it will be great to explicitly state as what is to be
> > expected.
> > > I would prefer to not have any sorts of rebalancing when this happens
> > since
> > > the connector and task assignments should not change with this event.
> It
> > > will be useful to explain this a bit more.
> > >
> > > Thanks,
> > > Magesh
> > >
> > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <chr...@confluent.io>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've made some tweaks to the KIP that I believe are improvements.
> More
> > > > detail can be found on the KIP page itself, but as a brief summary,
> the
> > > > three changes are:
> > > >
> > > > - The removal of the internal.request.verification property in favor
> of
> > > > modifying the default value for the connect.protocol property from
> > > > "compatible" to "sessioned"
> > > > - The renaming of some configurations to use better terminology
> (mostly
> > > > just "request" instead of "key" where appropriate)
> > > > - The addition of two new configurations that dictate how session
> keys
> > > are
> > > > to be generated
> > > >
> > > > Thanks Ryanne for the feedback so far, hope to hear from some more of
> > you
> > > > soon :)
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <chr...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi Ryanne,
> > > > >
> > > > > The reasoning for this is provided in the KIP: "There would be no
> > clear
> > > > > way to achieve consensus amongst the workers in a cluster on
> whether
> > to
> > > > > switch to this new behavior." To elaborate on this--it would be bad
> > if
> > > a
> > > > > follower worker began writing task configurations to the topic
> while
> > > the
> > > > > leader continued to only listen on the internal REST endpoint. It's
> > > > > necessary to ensure that every worker in a cluster supports this
> new
> > > > > behavior before switching it on.
> > > > >
> > > > > To be fair, it is technically possible to achieve consensus without
> > > using
> > > > > the group coordination protocol by adding new configurations and
> > using
> > > a
> > > > > multi-phase rolling upgrade, but the operational complexity of this
> > > > > approach would significantly complicate things for the default case
> > of
> > > > just
> > > > > wanting to bump your Connect cluster to the next version without
> > having
> > > > to
> > > > > alter your existing worker config files and with only a single
> > restart
> > > > per
> > > > > worker. The proposed approach provides a much simpler user
> experience
> > > for
> > > > > the most common use case and doesn't impose much additional
> > complexity
> > > > for
> > > > > other use cases.
> > > > >
> > > > > I've updated the KIP to expand on points from your last emails; let
> > me
> > > > > know what you think.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan <
> ryannedo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Thanks Chris, that makes sense.
> > > > >>
> > > > >> I know you have already considered this, but I'm not convinced we
> > > should
> > > > >> rule out using Kafka topics for this purpose. That would enable
> the
> > > same
> > > > >> level of security without introducing any new authentication or
> > > > >> authorization mechanisms (your keys). And as you say, you'd need
> to
> > > lock
> > > > >> down Connect's topics and groups anyway.
> > > > >>
> > > > >> Can you explain what you mean when you say using Kafka topics
> would
> > > > >> require
> > > > >> "reworking the group coordination protocol"? I don't see how these
> > are
> > > > >> related. Why would it matter if the workers sent Kafka messages vs
> > > POST
> > > > >> requests among themselves?
> > > > >>
> > > > >> Ryanne
> > > > >>
> > > > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton <
> chr...@confluent.io>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Ryanne,
> > > > >> >
> > > > >> > Yes, if the Connect group is left unsecured then that is a
> > potential
> > > > >> > vulnerability. However, in a truly secure Connect cluster, the
> > group
> > > > >> would
> > > > >> > need to be secured anyways to prevent attackers from joining the
> > > group
> > > > >> with
> > > > >> > the intent to either snoop on connector/task configurations or
> > bring
> > > > the
> > > > >> > cluster to a halt by spamming the group with membership requests
> > and
> > > > >> then
> > > > >> > not running the assigned connectors/tasks. Additionally, for a
> > > Connect
> > > > >> > cluster to be secure, access to internal topics (for configs,
> > > offsets,
> > > > >> and
> > > > >> > statuses) would also need to be restricted so that attackers
> could
> > > > not,
> > > > >> > e.g., write arbitrary connector/task configurations to the
> configs
> > > > >> topic.
> > > > >> > This is all currently possible in Kafka with the use of ACLs.
> > > > >> >
> > > > >> > I think the bottom line here is that there's a number of steps
> > that
> > > > >> need to
> > > > >> > be taken to effectively lock down a Connect cluster; the point
> of
> > > this
> > > > >> KIP
> > > > >> > is to close a loophole that exists even after all of those steps
> > are
> > > > >> taken,
> > > > >> > not to completely secure this one vulnerability even when no
> other
> > > > >> security
> > > > >> > measures are taken.
> > > > >> >
> > > > >> > Cheers,
> > > > >> >
> > > > >> > Chris
> > > > >> >
> > > > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan <
> > > ryannedo...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Chris, I don't understand how the rebalance protocol can be
> used
> > > to
> > > > >> give
> > > > >> > > out session tokens in a secure way. It seems that any attacker
> > > could
> > > > >> just
> > > > >> > > join the group and sign requests with the provided token. Am I
> > > > missing
> > > > >> > > something?
> > > > >> > >
> > > > >> > > Ryanne
> > > > >> > >
> > > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton <
> > chr...@confluent.io>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > The KIP page can be found at
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > > > >> > > > ,
> > > > >> > > > by the way. Apologies for neglecting to include it in my
> > initial
> > > > >> email!
> > > > >> > > >
> > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton <
> > > > chr...@confluent.io
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi all,
> > > > >> > > > >
> > > > >> > > > > I'd like to start discussion on a KIP to secure the
> internal
> > > > "POST
> > > > >> > > > > /connectors/<name>/tasks" endpoint for the Connect
> > framework.
> > > > The
> > > > >> > > > proposed
> > > > >> > > > > changes address a vulnerability in the framework in its
> > > current
> > > > >> state
> > > > >> > > > that
> > > > >> > > > > allows malicious users to write arbitrary task
> > configurations
> > > > for
> > > > >> > > > > connectors; it is vital that this issue be addressed in
> > order
> > > > for
> > > > >> any
> > > > >> > > > > Connect cluster to be secure.
> > > > >> > > > >
> > > > >> > > > > Looking forward to your thoughts,
> > > > >> > > > >
> > > > >> > > > > Chris
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to