12, (8:30 AM PST)

Michael Marshall Thu, 12 May 2022 22:38:51 -0700

Hi Pulsar Community,

Here are the meeting notes from yesterday's community meeting. There
were many relevant comments, so the notes are particularly long.
Thanks to all who participated!


Disclaimer: If something is misattributed or misrepresented, please
send a correction to this list.

Source google doc:
https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE

Thanks,
Michael

2022/05/12, (8:30 AM PST)
-   Attendees:
-   Matteo Merli
-   Lari Hotari
-   Dave Fisher
-   Aaron Williams
-   Michael Marshall
-   Andrey Yegorov
-   Massimiliano Mirelli
-   Nicolò Boschi
-   Enrico Olivelli

-   Discussions

-   Enrico: discuss PIP 117. Matteo: can we pass a cli flag? Enrico:
that won’t work for the old version, but we can make it work with an
env var. They both agreed on that solution.

-   Matteo: asks about the JDK discussion from the mailing list.
Enrico: we are trying to understand the impact to customers if we have
a strict requirement for JDK 17. Pretty sure it is not the right time
to force users to upgrade to 17. Jumping from 8 to 17 is a really big
jump. Middle ground is to require 11. Matteo: JDK 11 doesn’t give us
any big source changes. There are some runtime improvements, though.
Switching to 11 is a hassle for users because they have to upgrade
from 8 to 11. OTOH, there are many new features from 8 to 17 that we
can start to use. There are also a lot of runtime improvements from 11
to 17. Yes, there are still people using 8. There are also people that
will delay adoption of Pulsar 2.11. We can give these users a 2.10
LTS, which brings back the other discussion on LTS and how to make
that work. Dave: one question to ask is if 2.10 is a feature complete
version. Will 2.11 have features we wish were in 2.10? Matteo: each
version is supposed to be feature complete. Michael: a nuance might be
transactions, which were not production ready until recently. Matteo:
non-production ready features would be a reason to upgrade. Dave: what
about new users that might only be using JDK 8? Note that Oracle
dropped commercial support recently. Matteo: thinks Java has taken
good steps to ensure users can upgrade. Michael: we’ll need to take
care to make sure that cherry picks are still easily applied.
Refactoring could prove problematic. Matteo: we should make sure to
update the contributor guide to document that we won’t accept
arbitrary refactoring to use JDK 17 specific language features. Dave:
how can CI help us? Matteo: ci will fail if cherry picked to recent
versions. When a committer merges to master, a committer should cherry
pick at that time. This way the release manager doesn’t have as much
work. Dave: how do we keep track of outstanding cherry picks? It can
definitely become unwieldy. Michael: local compilation could help to
prevent pushing unsupported language features to maintenance branches.
Matteo: yes, that’s true. Lari: I usually run "mvn
-Pcore-modules,-main -T 1C clean install -DskipTests
-Dspotbugs.skip=true" since it's fairly quick to run. Also, flaky
tests make it harder to trust the signals for a failed build on a
maintenance branch. Michael: I think it’s valuable that we’re going to
guide contributors to avoid code refactoring. Matteo: yes, are we able
to decide when we’ll start accepting those refactorings? New features
(net new code) will be fine for new language features because it won’t
get cherry picked. Is there a time limit for when we can start using
new language features? Dave: for now don’t do it and we can review
later. Lari: same issue about refactoring applies to more than just
JDK 17 features. Recently found PRs that make sense, but cause issues
because they have sprawling changes. High level guidance would be very
helpful. This raises the question of when we can do these kinds of
refactorings. For example, the Managed Ledger class is a huge class
and it’d be great to refactor, but we don’t because it’s more stable
to leave it as is for now. In this example, how can we redesign it in
a safe way? There are some thread safety issues that might be resolved
with a larger change. Matteo: would it make sense to start fresh?
There is an interface (possibly not well abstracted), but could we
have two implementations. Lari: I’ve had that thought, but maybe it’s
a Pulsar 3.0 type change. Larger refactorings could be done at that
time. Otherwise we’re stuck with the same code base. Matteo: I’m
saying we can use a different implementation because it is pluggable.
We do this for other classes, already. It’d probably be easier to
start from scratch and mature a prototype. This allows users to opt
into the new code. That gives contributors more freedom to move fast.
Lari: that’s probably the approach we’d take. This approach won’t work
for something like load manager, though because the code doesn’t have
a good abstraction. Matteo: yes, that is something we’ve been looking
into. It’d probably need a new abstraction. We are in the phase of
getting people to understand what is working and why it is the way it
currently is so that they can come up with a plan to solve those
issues.

-   Action items from above discussion block:
--   Define policy to make 2.10 LTS
--   Have guidelines for contributors and committers for expectation
in new code, bug fixes, and code refactoring

-   Lari: let’s discuss ways to improve the load manager. Matteo: we
could have a small working group on the load manager. Lari: that would
be very useful. We haven’t done any concrete design yet, but one
detail we’ve discussed is how to decrease downtime of topics so that
the handover is smooth. Streams should be migrated without disruption.
Matteo: there is always a disruption. Lari: it could be better though.
Matteo: is it user perceptible or not? If the disruption is a 10 to 20
ms bump it’s okay, but if it is 3 seconds, it’s a problem. Michael:
there are protocol improvements that could be made, like letting the
producer/consumer know the new topic host. Matteo: there are many
problems with the current load manager. Unload is the only operational
tool. Lookup is problematic, could be more efficient. Instead of
lookup every partition individually, you could lookup the mapping for
every partition in one call. Essentially, we could batch operations.
One historical reason for the design is to hide the bundling logic and
keep it transparent to client. We could probably stop hiding the
bundling logic from the client. Having a namespace wide lookup could
do a single lookup to reduce the herd of requests to brokers when
there is a failover operation. The other part affecting the failover
latency: topics are assigned bundle by bundle, not one by one, the
goal of the bundle is to not lead to a bottleneck. However, that means
we have to close all topics in a bundle before the bundle can be
moved. Zookeeper batching is very helpful for speeding up these
operations. There are some metrics for shutting down a bundle.
Finally, there is a client interaction, which can hit backoffs. These
are all co-related to load manager. Then there are the questions of
how to assign bundles to topics, how can we have more transparency?
How can an operator have more control? Michael: I’m very excited by
the idea of a load manager working group.

Pulsar Community Meeting Notes 2022/05/12, (8:30 AM PST)

Reply via email to