Hey Colin, For context on this specific issue, we have opened a JIRA to consider thread safety in the future. Another option is documentation or to make thread local. Don't want to detract too much from this conversation, but did want to say there is a JIRA to discuss the buffer specific problem. https://issues.apache.org/jira/browse/KAFKA-15674
Thanks all, Justine On Tue, Oct 24, 2023 at 12:16 PM Colin McCabe <cmcc...@apache.org> wrote: > Hi Divij, > > I've worked on several projects that had a "debug mode." It was something > that a lot of old-fashioned C and C++ projects would do. Usually > implemented through an ASSERT macro or similar that was defined away when > in "production mode" > > I didn't like this back then, and still don't like it. If the assertion > isn't expensive, you should just do it all the time. If the assertion is > expensive, then you should do it in a test rather than when running. > Because an expensive operation will change the timings of a distributed > system, and make your "debug mode server" perform quite differently than > the "real production server." > > Another issue is that, based on my experience, people often did stuff in > the assert blocks that would change other things in the system. Since code > in C/C++ (and also Java) can have side effects, it's easy to accidentally > change things with your verification code. > > It sounds like concretely you hit a race condition with the > non-thread-safe buffer pool code. It would be good to think about how we > could avoid this in the future, but I don't think "debug mode" is the > answer. Instead, it might be better to take another look at how we're doing > buffer pooling to see if we can simplify. Why are we passing a > non-thread-safe object between threads in the first place? Should this be > documented better, or better yet, avoided? Why not use a thread-local > instead to make this all so much simpler? etc. > > best, > Colin > > On Tue, Oct 24, 2023, at 02:32, Divij Vaidya wrote: > > Hey folks > > > > We recently came across a bug [1] which was very hard to detect during > > testing and easy to introduce during development. I would like to kick > > start a discussion on potential ways which could avoid this category of > > bugs in Apache Kafka. > > > > I think we might want to start working towards a "debug" mode in the > broker > > which will enable assertions for different invariants in Kafka. > Invariants > > could be derived from formal verification that Jack [2] and others have > > shared with the community earlier AND from tribal knowledge in the > > community such as network threads should not perform any storage IO, > files > > should not fsync in critical product path, metric gauges should not > acquire > > a lock etc. The release qualification process (system tests + > integration > > tests) will run the broker in "debug" mode and will validate these > > assertions while testing the system in different scenarios. The > inspiration > > for this idea is derived from Marc Brooker's post at > > https://brooker.co.za/blog/2023/07/28/ds-testing.html > > > > Your thoughts on this topic are welcome! Also, please feel free to take > > this idea forward and draft a KIP for a more formal discussion. > > > > [1] https://issues.apache.org/jira/browse/KAFKA-15653 > > [2] https://lists.apache.org/thread/pfrkk0yb394l5qp8h5mv9vwthx15084j > > > > -- > > Divij Vaidya >