mjsax commented on PR #18106: URL: https://github.com/apache/kafka/pull/18106#issuecomment-2646612800
No worries. Just highlighting this so we can (maybe?) do better going forward. > I am sorry to hear that people had to spend multiple hours trying to fix this. and > Separately, do you have any suggestions on how could we have reduced the time to mitigation from hours to minutes here? Where was the most time spent on? The problem was not the fix, but to actually identify the commit that introduced the system test failure. We had to basically do a "binary search" over the commit history, and re-run the system test over and over again until we found the commit. After we had the commit, it was rather quick to understand the root cause. > I will look at why assumption 1 above was not correct in this scenario and improve the gap in my thought process. Well, this PR was ok (even if not ideal) for the integration test, but we don't run system tests on regular PRs, so it was easy to miss. No mistake was made on this front IMHO. > Also, I would explore how can we make it easier to for non-experts to change KS code as well, since, I think it would be beneficial for the community if we are able to scale without having to rely on experts for changes to a component. Overall I agree to the sentiment, but based on past experience, the Kafka code base is very complex and it's difficult to no rely on experts IMHO. -- Would it have slowed down getting this PR merged significantly (especially that it's a very small change), if you would have asked an expert to take a look? -- In the end, it's always your personal judgment I guess, and I don't think you did anything wrong. I can just share my personal approach, that I would never merge any code changing stuff that is not Kafka Streams, w/o an "component expert" to sign it off the non-KS changes (but maybe that's just me). To be frank: for this PR in particular, even if you would have got a review from an expert, it might have been missed easily -- I would not want to claim, that I would have seen the issue with the system test, if I would have reviewed this PR (contrary, I would say I would have approved this PR with 90+% probability, w/o catching the issue for the system test...) However, if we were aware of this PR, and we see the "system smoke test" which used the same code failing later, we could have connected the dots w/o the need to "find for the commit" that broke it. Just want to share my POV. Not sure if we can do better or not. Just want to trigger some discussion, in case somebody has a good idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org