Hi all, I've published a draft PR implementing the changes currently proposed in KIP-507, which can be found at https://github.com/apache/kafka/pull/7310. Thanks for all of the review and feedback so far! Given the lack of any major voiced reservations so far, if I don't hear anything else over the course of the next few days or so I'll be opening this for a vote.
Cheers, Chris On Wed, Aug 28, 2019 at 12:21 PM Chris Egerton <chr...@confluent.io> wrote: > HI all, > > Wow, thanks for all the feedback! Happy to see this proposal getting some > love :) > > > RE Konstantine's comments: > > > I've read the discussions regarding why the rebalancing protocol is used > here and your intention to follow the approach which was recently used in > order to elegantly support upgrades without requiring rolling restarts > makes sense. > > However, I'd like to revisit the proposed extension of the connect > protocol > going a little bit more in detail. > Indeed, broadcasting the session key to the followers is necessary. But > adding it to the configs topic with a new key is compatible with previous > versions (the key will be ignored by workers that don't support this > protocol) and works since all workers will need to read this topic to the > end to remain in the group. > > Additionally, to your point about consensus, meaning how the workers all > know during the same generation that the "sessioned" version of the > protocol was chosen by the leader, it seems to me that an explicit upgrade > of the protocol on the wire, on its name and on the metadata that are sent > with the join request by each worker might not be required. What I believe > would suffice is merely an increment of the version of the protocol, > because all the other information remains the same (for example, the list > of assigned and revoked tasks, etc). This would allow us not to extend the > metadata even more with yet another JoinGroupRequest and could achieve what > you want in terms of an orchestrated upgrade. > > I really like this idea, thanks for suggesting it. A simple bump of the > protocol version does seem sufficient to achieve consensus among workers > and would significantly reduce the implementation complexity of an entirely > separate protocol just for the sake of distributing keys, and using the > config topic to relay keys instead of the group coordination protocol would > prevent rebalances solely for the sake of key rotation (instead, the leader > can just periodically write a new key to the config topic). I'll update the > KIP by EOD today with these changes. > > > All the leader worker would have to check is whether all the assignments > that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if > you'd choose to go this route). > > I like this idea for dictating whether the leader requires request > signing, SGTM. I think we can have all workers use the latest session key > present in the config topic (if there is one) to sign their requests; since > the signatures are passed via header, this shouldn't cause any problems if > a follower signs a request that gets sent to a leader that isn't performing > request validation. > > > RE Magesh's comments: > > > Thanks a lot for the KIP. This will certainly be a useful feature. I > would > have preferred to use the topic approach as well but I also understand your > point of view about the operational complexity for upgrades. If not with > this KIP, I would certainly want to go that route at some point in the > future. > > I'm actually a little conflicted on this front. If this KIP passes and we > have a secure way to perform quick, synchronous, follower-to-leader > communication, it may be better to hold onto and even perhaps expand on in > case this functionality comes in handy later. Additionally, there would > have to be a lot of thought put into the semantics of using topic-based > communication for relaying task configs to the leader due to its > asynchronous nature; I'm not convinced it's not worth it, but I'm also not > entirely convinced it is. > > > As far as using the rebalance protocol goes, it would be great if you > could > elaborate on what exactly would be the rebalance impact when a key expires. > I see that you have called out saying that there should be no significant > impact but it will be great to explicitly state as what is to be expected. > I would prefer to not have any sorts of rebalancing when this happens since > the connector and task assignments should not change with this event. It > will be useful to explain this a bit more. > > Given Konstantine's comments, this question isn't strictly relevant > anymore (unless we decide to return to the approach of distributing session > keys during rebalance), but just for completeness it does seem worth > addressing. The "rebalances" proposed for key rotation would be achieved by > a request from the leader the rejoin the group. This would immediately > result in the leader being asked to assign connectors and tasks to the > group, which would exactly match the existing assignments, so with that and > the help of the new incremental cooperative rebalance semantics there would > be zero downtime for connectors or tasks. Truly, "rebalance" is a bit of a > misnomer here; we're really talking about performing group assignment > (which doesn't necessarily imply adding or removing connectors/tasks > to/from any workers), but as far as I've seen the two are basically used > synonymously w/r/t Connect so I elected to use the more common, albeit > slightly less accurate, term. > > > RE Greg's comments: > > > Does this roll-out ratchet the protocol version irreversibly? > > Nope, the rollout will be achieved by bumping the protocol (or, after > adjusting for Konstantine's suggestion, just the protocol version) for each > worker, but the Kafka group coordinator will only begin using that new > protocol if every worker indicates that that protocol is supported when it > joins the group. So, if a cluster existed with workers that all supported > protocols 1 and 2 and indicated a preference for protocol 2, the cluster > would use protocol 2. However, if a new worker that only supported protocol > 1 joined the cluster, the entire cluster would be bumped back down to > protocol 1. If/when that worker leaves or indicates it supports protocol 2, > the cluster would bump back up to protocol 2. > > >What is the expected behavior when the feature has been enabled, and a > worker joins the > group while advertising an old protocol? > > I partially answered this above; the entire cluster would downgrade its > protocol version. The practical effect of this would be that request > signature verification by the leader would immediately cease. > > > I think it's reasonable to prevent single-worker downgrades, since this > is an attack vector. > Actually, this is fine. Groups are a protected resource in Kafka; if you > want to ensure this doesn't happen, just secure your Kafka broker and use > something like ACLs to make sure only authorized entities can join your > group. > > > But what happens if every worker in a cluster goes down, and at least > the first worker comes > back with an old protocol? This is a potential attack that requires access > to the workers, but could also be used by customers to intentionally > downgrade their cluster. > > This kind of attack actually doesn't require access to any of the Connect > workers; group coordination is managed by the Kafka cluster. So, even if > your entire Connect cluster crashes and then a malicious user tries to join > the group, as long as your Kafka broker has restricted access to the group > used by your Connect cluster, that attack would still fail. > > > You mentioned that this security improvement only adds security when > there > are a number of existing security changes in place, one of these being ACLs > on the Kafka broker. Do you plan to gate this feature roll-out on detecting > that those prerequisite features are enabled? > > I wasn't planning on it. With the pluggable nature of Kafka and Kafka > Connect authorization, I imagine it'd be pretty difficult to know if all of > the important resources (worker group, internal topics, and Kafka Connect > REST API are a few that come to mind) are actually secured. Additionally, > the presence of other attack vectors shouldn't be justification for opening > up a new one; it seems particularly dangers in the event that someone > accidentally misconfigures their Connect cluster and this endpoint ends up > being exposed even though the user believes it to be protected. > > > If not, I think this feature adds no additional security to unsecured > clusters, while still incurring > the overhead on signing requests. > > This seems like a bit of an overstatement. The changes wouldn't add no > additional security; I think restricting even accidental access to the > internal REST endpoint is a good thing since it's not part of the public > API and it's all but guaranteed that if someone makes a request to that > endpoint they're either confused or malicious. Adding another hurdle in the > way for malicious users is also beneficial, even if it's not a > comprehensive solution. > > > Do you know how significant the overhead will be, as a rough estimate? > > Not very significant. The overhead would only be incurred when followers > need to relay task configurations to the leader, which shouldn't happen > very frequently in any stable connector. If that is happening frequently, > the overhead for starting new tasks or stopping/reconfiguring existing > tasks would probably be orders of magnitude higher anyways. > > > Thanks again to Magesh, Konstantine, and Greg for your feedback. I'll be > updating the KIP to reflect changes mentioned here and to include answers > to some of the questions raised; looking forward to the next round of > comments. > > Cheers, > > Chris > > On Wed, Aug 28, 2019 at 9:02 AM Greg Harris <gr...@confluent.io> wrote: > >> Hey Chris, >> >> The KIP makes sense to me, and I think it's a very natural way to solve >> the >> issue at hand. I had two questions about the automatic rollout logic. >> >> Does this roll-out ratchet the protocol version irreversibly? What is the >> expected behavior when the feature has been enabled, and a worker joins >> the >> group while advertising an old protocol? I think it's reasonable to >> prevent >> single-worker downgrades, since this is an attack vector. But what happens >> if every worker in a cluster goes down, and at least the first worker >> comes >> back with an old protocol? This is a potential attack that requires access >> to the workers, but could also be used by customers to intentionally >> downgrade their cluster. >> >> You mentioned that this security improvement only adds security when there >> are a number of existing security changes in place, one of these being >> ACLs >> on the Kafka broker. Do you plan to gate this feature roll-out on >> detecting >> that those prerequisite features are enabled? If not, I think this feature >> adds no additional security to unsecured clusters, while still incurring >> the overhead on signing requests. Do you know how significant the overhead >> will be, as a rough estimate? >> >> Looks great! >> >> Thanks, >> Greg >> >> >> On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis < >> konstant...@confluent.io> wrote: >> >> > Thanks for the KIP Chris! >> > >> > This proposal will fill a gap in Kafka Connect deployments and will >> > strengthen their production readiness in even more diverse >> environments, in >> > which current workarounds are less practical. >> > >> > I've read the discussions regarding why the rebalancing protocol is used >> > here and your intention to follow the approach which was recently used >> in >> > order to elegantly support upgrades without requiring rolling restarts >> > makes sense. >> > >> > However, I'd like to revisit the proposed extension of the connect >> protocol >> > going a little bit more in detail. >> > Indeed, broadcasting the session key to the followers is necessary. But >> > adding it to the configs topic with a new key is compatible with >> previous >> > versions (the key will be ignored by workers that don't support this >> > protocol) and works since all workers will need to read this topic to >> the >> > end to remain in the group. >> > >> > Additionally, to your point about consensus, meaning how the workers all >> > know during the same generation that the "sessioned" version of the >> > protocol was chosen by the leader, it seems to me that an explicit >> upgrade >> > of the protocol on the wire, on its name and on the metadata that are >> sent >> > with the join request by each worker might not be required. What I >> believe >> > would suffice is merely an increment of the version of the protocol, >> > because all the other information remains the same (for example, the >> list >> > of assigned and revoked tasks, etc). This would allow us not to extend >> the >> > metadata even more with yet another JoinGroupRequest and could achieve >> what >> > you want in terms of an orchestrated upgrade. >> > >> > Without having exhaustively checked the various code paths, I feel that >> > such an approach would be less heavy-handed in terms of extending the >> > format of the connect protocol and would achieve a certain separation of >> > concerns between the information that is required for rebalancing and is >> > included in the protocol metadata and the information that is required >> for >> > securing the internal Connect REST endpoint. In theory, this method >> could >> > be used to even secure the eager version of the protocol, but I'd agree >> > that this out of the scope of the current proposal. >> > >> > All the leader worker would have to check is whether all the assignments >> > that it took were in "sessioned" version (possibly >> CONNECT_PROTOCOL_V2=2 if >> > you'd choose to go this route). >> > >> > Overall, this does not differ a lot from your current proposal, which I >> > think is already in the right direction. >> > Let me know what you think. >> > >> > Cheers, >> > Konstantine >> > >> > >> > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <mage...@confluent.io >> > >> > wrote: >> > >> > > Hi Chris, >> > > >> > > Thanks a lot for the KIP. This will certainly be a useful feature. I >> > would >> > > have preferred to use the topic approach as well but I also understand >> > your >> > > point of view about the operational complexity for upgrades. If not >> with >> > > this KIP, I would certainly want to go that route at some point in the >> > > future. >> > > >> > > As far as using the rebalance protocol goes, it would be great if you >> > could >> > > elaborate on what exactly would be the rebalance impact when a key >> > expires. >> > > I see that you have called out saying that there should be no >> significant >> > > impact but it will be great to explicitly state as what is to be >> > expected. >> > > I would prefer to not have any sorts of rebalancing when this happens >> > since >> > > the connector and task assignments should not change with this event. >> It >> > > will be useful to explain this a bit more. >> > > >> > > Thanks, >> > > Magesh >> > > >> > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <chr...@confluent.io> >> > wrote: >> > > >> > > > Hi all, >> > > > >> > > > I've made some tweaks to the KIP that I believe are improvements. >> More >> > > > detail can be found on the KIP page itself, but as a brief summary, >> the >> > > > three changes are: >> > > > >> > > > - The removal of the internal.request.verification property in >> favor of >> > > > modifying the default value for the connect.protocol property from >> > > > "compatible" to "sessioned" >> > > > - The renaming of some configurations to use better terminology >> (mostly >> > > > just "request" instead of "key" where appropriate) >> > > > - The addition of two new configurations that dictate how session >> keys >> > > are >> > > > to be generated >> > > > >> > > > Thanks Ryanne for the feedback so far, hope to hear from some more >> of >> > you >> > > > soon :) >> > > > >> > > > Cheers, >> > > > >> > > > Chris >> > > > >> > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <chr...@confluent.io >> > >> > > > wrote: >> > > > >> > > > > Hi Ryanne, >> > > > > >> > > > > The reasoning for this is provided in the KIP: "There would be no >> > clear >> > > > > way to achieve consensus amongst the workers in a cluster on >> whether >> > to >> > > > > switch to this new behavior." To elaborate on this--it would be >> bad >> > if >> > > a >> > > > > follower worker began writing task configurations to the topic >> while >> > > the >> > > > > leader continued to only listen on the internal REST endpoint. >> It's >> > > > > necessary to ensure that every worker in a cluster supports this >> new >> > > > > behavior before switching it on. >> > > > > >> > > > > To be fair, it is technically possible to achieve consensus >> without >> > > using >> > > > > the group coordination protocol by adding new configurations and >> > using >> > > a >> > > > > multi-phase rolling upgrade, but the operational complexity of >> this >> > > > > approach would significantly complicate things for the default >> case >> > of >> > > > just >> > > > > wanting to bump your Connect cluster to the next version without >> > having >> > > > to >> > > > > alter your existing worker config files and with only a single >> > restart >> > > > per >> > > > > worker. The proposed approach provides a much simpler user >> experience >> > > for >> > > > > the most common use case and doesn't impose much additional >> > complexity >> > > > for >> > > > > other use cases. >> > > > > >> > > > > I've updated the KIP to expand on points from your last emails; >> let >> > me >> > > > > know what you think. >> > > > > >> > > > > Cheers, >> > > > > >> > > > > Chris >> > > > > >> > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan < >> ryannedo...@gmail.com> >> > > > > wrote: >> > > > > >> > > > >> Thanks Chris, that makes sense. >> > > > >> >> > > > >> I know you have already considered this, but I'm not convinced we >> > > should >> > > > >> rule out using Kafka topics for this purpose. That would enable >> the >> > > same >> > > > >> level of security without introducing any new authentication or >> > > > >> authorization mechanisms (your keys). And as you say, you'd need >> to >> > > lock >> > > > >> down Connect's topics and groups anyway. >> > > > >> >> > > > >> Can you explain what you mean when you say using Kafka topics >> would >> > > > >> require >> > > > >> "reworking the group coordination protocol"? I don't see how >> these >> > are >> > > > >> related. Why would it matter if the workers sent Kafka messages >> vs >> > > POST >> > > > >> requests among themselves? >> > > > >> >> > > > >> Ryanne >> > > > >> >> > > > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton < >> chr...@confluent.io> >> > > > >> wrote: >> > > > >> >> > > > >> > Hi Ryanne, >> > > > >> > >> > > > >> > Yes, if the Connect group is left unsecured then that is a >> > potential >> > > > >> > vulnerability. However, in a truly secure Connect cluster, the >> > group >> > > > >> would >> > > > >> > need to be secured anyways to prevent attackers from joining >> the >> > > group >> > > > >> with >> > > > >> > the intent to either snoop on connector/task configurations or >> > bring >> > > > the >> > > > >> > cluster to a halt by spamming the group with membership >> requests >> > and >> > > > >> then >> > > > >> > not running the assigned connectors/tasks. Additionally, for a >> > > Connect >> > > > >> > cluster to be secure, access to internal topics (for configs, >> > > offsets, >> > > > >> and >> > > > >> > statuses) would also need to be restricted so that attackers >> could >> > > > not, >> > > > >> > e.g., write arbitrary connector/task configurations to the >> configs >> > > > >> topic. >> > > > >> > This is all currently possible in Kafka with the use of ACLs. >> > > > >> > >> > > > >> > I think the bottom line here is that there's a number of steps >> > that >> > > > >> need to >> > > > >> > be taken to effectively lock down a Connect cluster; the point >> of >> > > this >> > > > >> KIP >> > > > >> > is to close a loophole that exists even after all of those >> steps >> > are >> > > > >> taken, >> > > > >> > not to completely secure this one vulnerability even when no >> other >> > > > >> security >> > > > >> > measures are taken. >> > > > >> > >> > > > >> > Cheers, >> > > > >> > >> > > > >> > Chris >> > > > >> > >> > > > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan < >> > > ryannedo...@gmail.com> >> > > > >> > wrote: >> > > > >> > >> > > > >> > > Chris, I don't understand how the rebalance protocol can be >> used >> > > to >> > > > >> give >> > > > >> > > out session tokens in a secure way. It seems that any >> attacker >> > > could >> > > > >> just >> > > > >> > > join the group and sign requests with the provided token. Am >> I >> > > > missing >> > > > >> > > something? >> > > > >> > > >> > > > >> > > Ryanne >> > > > >> > > >> > > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton < >> > chr...@confluent.io> >> > > > >> wrote: >> > > > >> > > >> > > > >> > > > The KIP page can be found at >> > > > >> > > > >> > > > >> > > > >> > > > >> > > >> > > > >> > >> > > > >> >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints >> > > > >> > > > , >> > > > >> > > > by the way. Apologies for neglecting to include it in my >> > initial >> > > > >> email! >> > > > >> > > > >> > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton < >> > > > chr...@confluent.io >> > > > >> > >> > > > >> > > > wrote: >> > > > >> > > > >> > > > >> > > > > Hi all, >> > > > >> > > > > >> > > > >> > > > > I'd like to start discussion on a KIP to secure the >> internal >> > > > "POST >> > > > >> > > > > /connectors/<name>/tasks" endpoint for the Connect >> > framework. >> > > > The >> > > > >> > > > proposed >> > > > >> > > > > changes address a vulnerability in the framework in its >> > > current >> > > > >> state >> > > > >> > > > that >> > > > >> > > > > allows malicious users to write arbitrary task >> > configurations >> > > > for >> > > > >> > > > > connectors; it is vital that this issue be addressed in >> > order >> > > > for >> > > > >> any >> > > > >> > > > > Connect cluster to be secure. >> > > > >> > > > > >> > > > >> > > > > Looking forward to your thoughts, >> > > > >> > > > > >> > > > >> > > > > Chris >> > > > >> > > > > >> > > > >> > > > >> > > > >> > > >> > > > >> > >> > > > >> >> > > > > >> > > > >> > > >> > >> >