Following up on this, the PR to fix the Java Client [0] is still open and needs reviews. Please take a look, if you're able.
[0] - https://github.com/apache/pulsar/pull/12779 Thanks! Michael On Fri, Nov 12, 2021 at 5:35 PM Michael Marshall <mikemars...@gmail.com> wrote: > > Hi Pulsar Community, > > I discovered a race condition in Pulsar’s Java Client ProducerImpl > that can lead to messages persisted out-of-order for a single producer > sending to a non-partitioned topic. I can reproduce this issue, and I > verified the order by adding sequence ids to the message payload > before calling `producer.send`. I opened a PR to fix the race [0] and > another to improve the broker’s behavior [1]. > > At a high level, the ProducerImpl can get into a corrupt state if it > switches connections too quickly. In this corrupt state, the producer > can send messages before, during, and after the producer is registered > to the broker. Because the broker ignores messages until a producer is > created for the ServerCnx, some of the early messages are ignored and, > once the producer is created, some later ones are persisted. > > In PR [1], I propose that when a broker gets an unexpected message > (Send command), it should close the connection to protect against > clients that are not following the protocol instead of simply ignoring > unexpected messages. The protocol already states that clients are to > register producers and then start sending messages [2]. It does not > state what happens if a client does not follow this part of the > protocol. > > One tradeoff for this implementation is that when the broker initiates > closing a producer, there is a chance that the whole connection will > get closed if the producer has messages in flight. I think this is a > reasonable tradeoff to ensure that clients not following the protocol > are not able to persist messages out-of-order. > > From my perspective, this is the simplest solution that will ensure > message order is preserved. Alternatively, we could come up with logic > to try to handle messages sent to "recently" closed producers, but > that would greatly increase the complexity for this edge case. Note > that it is not sufficient to reply to each message with a SendError > because the producer may have already sent later messages and those > could be persisted if the producer is concurrently being created. Note > also that when the Java Client producer receives a generic SendError, > it reacts by closing the connection in most cases. > > I include more detail in each of the PRs. I look forward to your feedback. > > Thanks, > Michael > > [0] - https://github.com/apache/pulsar/pull/12779 > [1] - https://github.com/apache/pulsar/pull/12780 > [2] - https://pulsar.apache.org/docs/en/develop-binary-protocol/#producer