Hey Chris, The KIP makes sense to me, and I think it's a very natural way to solve the issue at hand. I had two questions about the automatic rollout logic.
Does this roll-out ratchet the protocol version irreversibly? What is the expected behavior when the feature has been enabled, and a worker joins the group while advertising an old protocol? I think it's reasonable to prevent single-worker downgrades, since this is an attack vector. But what happens if every worker in a cluster goes down, and at least the first worker comes back with an old protocol? This is a potential attack that requires access to the workers, but could also be used by customers to intentionally downgrade their cluster. You mentioned that this security improvement only adds security when there are a number of existing security changes in place, one of these being ACLs on the Kafka broker. Do you plan to gate this feature roll-out on detecting that those prerequisite features are enabled? If not, I think this feature adds no additional security to unsecured clusters, while still incurring the overhead on signing requests. Do you know how significant the overhead will be, as a rough estimate? Looks great! Thanks, Greg On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis < konstant...@confluent.io> wrote: > Thanks for the KIP Chris! > > This proposal will fill a gap in Kafka Connect deployments and will > strengthen their production readiness in even more diverse environments, in > which current workarounds are less practical. > > I've read the discussions regarding why the rebalancing protocol is used > here and your intention to follow the approach which was recently used in > order to elegantly support upgrades without requiring rolling restarts > makes sense. > > However, I'd like to revisit the proposed extension of the connect protocol > going a little bit more in detail. > Indeed, broadcasting the session key to the followers is necessary. But > adding it to the configs topic with a new key is compatible with previous > versions (the key will be ignored by workers that don't support this > protocol) and works since all workers will need to read this topic to the > end to remain in the group. > > Additionally, to your point about consensus, meaning how the workers all > know during the same generation that the "sessioned" version of the > protocol was chosen by the leader, it seems to me that an explicit upgrade > of the protocol on the wire, on its name and on the metadata that are sent > with the join request by each worker might not be required. What I believe > would suffice is merely an increment of the version of the protocol, > because all the other information remains the same (for example, the list > of assigned and revoked tasks, etc). This would allow us not to extend the > metadata even more with yet another JoinGroupRequest and could achieve what > you want in terms of an orchestrated upgrade. > > Without having exhaustively checked the various code paths, I feel that > such an approach would be less heavy-handed in terms of extending the > format of the connect protocol and would achieve a certain separation of > concerns between the information that is required for rebalancing and is > included in the protocol metadata and the information that is required for > securing the internal Connect REST endpoint. In theory, this method could > be used to even secure the eager version of the protocol, but I'd agree > that this out of the scope of the current proposal. > > All the leader worker would have to check is whether all the assignments > that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if > you'd choose to go this route). > > Overall, this does not differ a lot from your current proposal, which I > think is already in the right direction. > Let me know what you think. > > Cheers, > Konstantine > > > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <mage...@confluent.io> > wrote: > > > Hi Chris, > > > > Thanks a lot for the KIP. This will certainly be a useful feature. I > would > > have preferred to use the topic approach as well but I also understand > your > > point of view about the operational complexity for upgrades. If not with > > this KIP, I would certainly want to go that route at some point in the > > future. > > > > As far as using the rebalance protocol goes, it would be great if you > could > > elaborate on what exactly would be the rebalance impact when a key > expires. > > I see that you have called out saying that there should be no significant > > impact but it will be great to explicitly state as what is to be > expected. > > I would prefer to not have any sorts of rebalancing when this happens > since > > the connector and task assignments should not change with this event. It > > will be useful to explain this a bit more. > > > > Thanks, > > Magesh > > > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <chr...@confluent.io> > wrote: > > > > > Hi all, > > > > > > I've made some tweaks to the KIP that I believe are improvements. More > > > detail can be found on the KIP page itself, but as a brief summary, the > > > three changes are: > > > > > > - The removal of the internal.request.verification property in favor of > > > modifying the default value for the connect.protocol property from > > > "compatible" to "sessioned" > > > - The renaming of some configurations to use better terminology (mostly > > > just "request" instead of "key" where appropriate) > > > - The addition of two new configurations that dictate how session keys > > are > > > to be generated > > > > > > Thanks Ryanne for the feedback so far, hope to hear from some more of > you > > > soon :) > > > > > > Cheers, > > > > > > Chris > > > > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <chr...@confluent.io> > > > wrote: > > > > > > > Hi Ryanne, > > > > > > > > The reasoning for this is provided in the KIP: "There would be no > clear > > > > way to achieve consensus amongst the workers in a cluster on whether > to > > > > switch to this new behavior." To elaborate on this--it would be bad > if > > a > > > > follower worker began writing task configurations to the topic while > > the > > > > leader continued to only listen on the internal REST endpoint. It's > > > > necessary to ensure that every worker in a cluster supports this new > > > > behavior before switching it on. > > > > > > > > To be fair, it is technically possible to achieve consensus without > > using > > > > the group coordination protocol by adding new configurations and > using > > a > > > > multi-phase rolling upgrade, but the operational complexity of this > > > > approach would significantly complicate things for the default case > of > > > just > > > > wanting to bump your Connect cluster to the next version without > having > > > to > > > > alter your existing worker config files and with only a single > restart > > > per > > > > worker. The proposed approach provides a much simpler user experience > > for > > > > the most common use case and doesn't impose much additional > complexity > > > for > > > > other use cases. > > > > > > > > I've updated the KIP to expand on points from your last emails; let > me > > > > know what you think. > > > > > > > > Cheers, > > > > > > > > Chris > > > > > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan <ryannedo...@gmail.com> > > > > wrote: > > > > > > > >> Thanks Chris, that makes sense. > > > >> > > > >> I know you have already considered this, but I'm not convinced we > > should > > > >> rule out using Kafka topics for this purpose. That would enable the > > same > > > >> level of security without introducing any new authentication or > > > >> authorization mechanisms (your keys). And as you say, you'd need to > > lock > > > >> down Connect's topics and groups anyway. > > > >> > > > >> Can you explain what you mean when you say using Kafka topics would > > > >> require > > > >> "reworking the group coordination protocol"? I don't see how these > are > > > >> related. Why would it matter if the workers sent Kafka messages vs > > POST > > > >> requests among themselves? > > > >> > > > >> Ryanne > > > >> > > > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton <chr...@confluent.io> > > > >> wrote: > > > >> > > > >> > Hi Ryanne, > > > >> > > > > >> > Yes, if the Connect group is left unsecured then that is a > potential > > > >> > vulnerability. However, in a truly secure Connect cluster, the > group > > > >> would > > > >> > need to be secured anyways to prevent attackers from joining the > > group > > > >> with > > > >> > the intent to either snoop on connector/task configurations or > bring > > > the > > > >> > cluster to a halt by spamming the group with membership requests > and > > > >> then > > > >> > not running the assigned connectors/tasks. Additionally, for a > > Connect > > > >> > cluster to be secure, access to internal topics (for configs, > > offsets, > > > >> and > > > >> > statuses) would also need to be restricted so that attackers could > > > not, > > > >> > e.g., write arbitrary connector/task configurations to the configs > > > >> topic. > > > >> > This is all currently possible in Kafka with the use of ACLs. > > > >> > > > > >> > I think the bottom line here is that there's a number of steps > that > > > >> need to > > > >> > be taken to effectively lock down a Connect cluster; the point of > > this > > > >> KIP > > > >> > is to close a loophole that exists even after all of those steps > are > > > >> taken, > > > >> > not to completely secure this one vulnerability even when no other > > > >> security > > > >> > measures are taken. > > > >> > > > > >> > Cheers, > > > >> > > > > >> > Chris > > > >> > > > > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan < > > ryannedo...@gmail.com> > > > >> > wrote: > > > >> > > > > >> > > Chris, I don't understand how the rebalance protocol can be used > > to > > > >> give > > > >> > > out session tokens in a secure way. It seems that any attacker > > could > > > >> just > > > >> > > join the group and sign requests with the provided token. Am I > > > missing > > > >> > > something? > > > >> > > > > > >> > > Ryanne > > > >> > > > > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton < > chr...@confluent.io> > > > >> wrote: > > > >> > > > > > >> > > > The KIP page can be found at > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints > > > >> > > > , > > > >> > > > by the way. Apologies for neglecting to include it in my > initial > > > >> email! > > > >> > > > > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton < > > > chr...@confluent.io > > > >> > > > > >> > > > wrote: > > > >> > > > > > > >> > > > > Hi all, > > > >> > > > > > > > >> > > > > I'd like to start discussion on a KIP to secure the internal > > > "POST > > > >> > > > > /connectors/<name>/tasks" endpoint for the Connect > framework. > > > The > > > >> > > > proposed > > > >> > > > > changes address a vulnerability in the framework in its > > current > > > >> state > > > >> > > > that > > > >> > > > > allows malicious users to write arbitrary task > configurations > > > for > > > >> > > > > connectors; it is vital that this issue be addressed in > order > > > for > > > >> any > > > >> > > > > Connect cluster to be secure. > > > >> > > > > > > > >> > > > > Looking forward to your thoughts, > > > >> > > > > > > > >> > > > > Chris > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > >