Hi Pulsar Community, Here are the meeting notes from yesterday's community meeting. There were many relevant comments, so the notes are particularly long. Thanks to all who participated!
Disclaimer: If something is misattributed or misrepresented, please send a correction to this list. Source google doc: https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE Thanks, Michael 2022/05/12, (8:30 AM PST) - Attendees: - Matteo Merli - Lari Hotari - Dave Fisher - Aaron Williams - Michael Marshall - Andrey Yegorov - Massimiliano Mirelli - Nicolò Boschi - Enrico Olivelli - Discussions - Enrico: discuss PIP 117. Matteo: can we pass a cli flag? Enrico: that won’t work for the old version, but we can make it work with an env var. They both agreed on that solution. - Matteo: asks about the JDK discussion from the mailing list. Enrico: we are trying to understand the impact to customers if we have a strict requirement for JDK 17. Pretty sure it is not the right time to force users to upgrade to 17. Jumping from 8 to 17 is a really big jump. Middle ground is to require 11. Matteo: JDK 11 doesn’t give us any big source changes. There are some runtime improvements, though. Switching to 11 is a hassle for users because they have to upgrade from 8 to 11. OTOH, there are many new features from 8 to 17 that we can start to use. There are also a lot of runtime improvements from 11 to 17. Yes, there are still people using 8. There are also people that will delay adoption of Pulsar 2.11. We can give these users a 2.10 LTS, which brings back the other discussion on LTS and how to make that work. Dave: one question to ask is if 2.10 is a feature complete version. Will 2.11 have features we wish were in 2.10? Matteo: each version is supposed to be feature complete. Michael: a nuance might be transactions, which were not production ready until recently. Matteo: non-production ready features would be a reason to upgrade. Dave: what about new users that might only be using JDK 8? Note that Oracle dropped commercial support recently. Matteo: thinks Java has taken good steps to ensure users can upgrade. Michael: we’ll need to take care to make sure that cherry picks are still easily applied. Refactoring could prove problematic. Matteo: we should make sure to update the contributor guide to document that we won’t accept arbitrary refactoring to use JDK 17 specific language features. Dave: how can CI help us? Matteo: ci will fail if cherry picked to recent versions. When a committer merges to master, a committer should cherry pick at that time. This way the release manager doesn’t have as much work. Dave: how do we keep track of outstanding cherry picks? It can definitely become unwieldy. Michael: local compilation could help to prevent pushing unsupported language features to maintenance branches. Matteo: yes, that’s true. Lari: I usually run "mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true" since it's fairly quick to run. Also, flaky tests make it harder to trust the signals for a failed build on a maintenance branch. Michael: I think it’s valuable that we’re going to guide contributors to avoid code refactoring. Matteo: yes, are we able to decide when we’ll start accepting those refactorings? New features (net new code) will be fine for new language features because it won’t get cherry picked. Is there a time limit for when we can start using new language features? Dave: for now don’t do it and we can review later. Lari: same issue about refactoring applies to more than just JDK 17 features. Recently found PRs that make sense, but cause issues because they have sprawling changes. High level guidance would be very helpful. This raises the question of when we can do these kinds of refactorings. For example, the Managed Ledger class is a huge class and it’d be great to refactor, but we don’t because it’s more stable to leave it as is for now. In this example, how can we redesign it in a safe way? There are some thread safety issues that might be resolved with a larger change. Matteo: would it make sense to start fresh? There is an interface (possibly not well abstracted), but could we have two implementations. Lari: I’ve had that thought, but maybe it’s a Pulsar 3.0 type change. Larger refactorings could be done at that time. Otherwise we’re stuck with the same code base. Matteo: I’m saying we can use a different implementation because it is pluggable. We do this for other classes, already. It’d probably be easier to start from scratch and mature a prototype. This allows users to opt into the new code. That gives contributors more freedom to move fast. Lari: that’s probably the approach we’d take. This approach won’t work for something like load manager, though because the code doesn’t have a good abstraction. Matteo: yes, that is something we’ve been looking into. It’d probably need a new abstraction. We are in the phase of getting people to understand what is working and why it is the way it currently is so that they can come up with a plan to solve those issues. - Action items from above discussion block: -- Define policy to make 2.10 LTS -- Have guidelines for contributors and committers for expectation in new code, bug fixes, and code refactoring - Lari: let’s discuss ways to improve the load manager. Matteo: we could have a small working group on the load manager. Lari: that would be very useful. We haven’t done any concrete design yet, but one detail we’ve discussed is how to decrease downtime of topics so that the handover is smooth. Streams should be migrated without disruption. Matteo: there is always a disruption. Lari: it could be better though. Matteo: is it user perceptible or not? If the disruption is a 10 to 20 ms bump it’s okay, but if it is 3 seconds, it’s a problem. Michael: there are protocol improvements that could be made, like letting the producer/consumer know the new topic host. Matteo: there are many problems with the current load manager. Unload is the only operational tool. Lookup is problematic, could be more efficient. Instead of lookup every partition individually, you could lookup the mapping for every partition in one call. Essentially, we could batch operations. One historical reason for the design is to hide the bundling logic and keep it transparent to client. We could probably stop hiding the bundling logic from the client. Having a namespace wide lookup could do a single lookup to reduce the herd of requests to brokers when there is a failover operation. The other part affecting the failover latency: topics are assigned bundle by bundle, not one by one, the goal of the bundle is to not lead to a bottleneck. However, that means we have to close all topics in a bundle before the bundle can be moved. Zookeeper batching is very helpful for speeding up these operations. There are some metrics for shutting down a bundle. Finally, there is a client interaction, which can hit backoffs. These are all co-related to load manager. Then there are the questions of how to assign bundles to topics, how can we have more transparency? How can an operator have more control? Michael: I’m very excited by the idea of a load manager working group.