mjsax commented on PR #18106:
URL: https://github.com/apache/kafka/pull/18106#issuecomment-2646612800

   No worries. Just highlighting this so we can (maybe?) do better going 
forward.
   
   > I am sorry to hear that people had to spend multiple hours trying to fix 
this.
   
   and 
   
   > Separately, do you have any suggestions on how could we have reduced the 
time to mitigation from hours to minutes here? Where was the most time spent on?
   
   The problem was not the fix, but to actually identify the commit that 
introduced the system test failure. We had to basically do a "binary search" 
over the commit history, and re-run the system test over and over again until 
we found the commit. After we had the commit, it was rather quick to understand 
the root cause.
   
   > I will look at why assumption 1 above was not correct in this scenario and 
improve the gap in my thought process.
   
   Well, this PR was ok (even if not ideal) for the integration test, but we 
don't run system tests on regular PRs, so it was easy to miss. No mistake was 
made on this front IMHO.
   
   > Also, I would explore how can we make it easier to for non-experts to 
change KS code as well, since, I think it would be beneficial for the community 
if we are able to scale without having to rely on experts for changes to a 
component.
   
   Overall I agree to the sentiment, but based on past experience, the Kafka 
code base is very complex and it's difficult to no rely on experts IMHO. -- 
Would it have slowed down getting this PR merged significantly (especially that 
it's a very small change), if you would have asked an expert to take a look? -- 
In the end, it's always your personal judgment I guess, and I don't think you 
did anything wrong. I can just share my personal approach, that I would never 
merge any code changing stuff that is not Kafka Streams, w/o an "component 
expert" to sign it off the non-KS changes (but maybe that's just me).
   
   To be frank: for this PR in particular, even if you would have got a review 
from an expert, it might have been missed easily -- I would not want to claim, 
that I would have seen the issue with the system test, if I would have reviewed 
this PR (contrary, I would say I would have approved this PR with 90+% 
probability, w/o catching the issue for the system test...) However, if we were 
aware of this PR, and we see the "system smoke test" which used the same code 
failing later, we could have connected the dots w/o the need to "find for the 
commit" that broke it.
   
   Just want to share my POV. Not sure if we can do better or not. Just want to 
trigger some discussion, in case somebody has a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to