Re: [DISCUSS] PIP-145: Improve performance of regex subscriptions

Michael Marshall Tue, 08 Mar 2022 21:36:05 -0800

My primary concern is that our event driven messaging platform relies
on polling for distribution of system events, like newly created
topics. I would love to find a purely event driven solution.

I have some additional questions before I reply to the above feedback.

How will this feature affect Pulsar's load balancing of topics? The
PIP leads me to think that all brokers will host the
TopicsListService, which means the same data will be stored across
many brokers. How will this affect clusters with many topics across
many namespaces?

> There is no message filtering in this proposal. Only topic list filtering.

Broker side filtering of topic names based on arbitrary, user supplied
regex feels like message filtering, which is why I mentioned the
concept. I am pretty sure that arbitrary server side regex computation
opens up the attack surface of a broker by creating the possibility
for a ReDoS (regular expression denial of service). At the very least, it
could negatively impact tenant isolation within a broker if an expression
is computationally expensive. Am I correct in thinking that we'll need
to address this risk?

Also, there is no mention of a limit on the number of topic observers
per connection. We might need to consider making it configurable
to limit the number of observers in order to prevent an arbitrary
number of arbitrary regular expressions being run on each new topic
name.

> The proposal is to augment the polling behavior with "best-effort"
> notifications, that will improve the discoverability in the best case
> but that we can correct in all cases where notifications are lost.

Thanks for clarifying. I missed the line that indicates the client
will continue to poll.

> The problem is that this notification mechanism will not be reliable
> and thus it will not be possible to trust the compacted view of the
> topic.
> The operations of "create a topic" and "publish the notification in a
> system topic" are not atomic, therefore the notification might be
> missing, invalidating the state.

What makes this notification mechanism unreliable? I wonder if there
is a way for us to address the lack of reliability while becoming more
event driven. I accept that the current design of metadata watches
might not have strong enough consistency guarantees for my proposed
alternative.

> We're talking of adding a core Pulsar feature here, so it doesn't look
> inappropriate to me for that to involve a (fully compatible) change in
> the protocol.

I am not opposed to expanding the protocol. I mentioned the protocol
expansion only to indicate that it requires more work for each client to
take advantage of the new feature.

> Adding a system topic is actually a bigger change compared to adding a
> watch mechanism in the protocol. The reason is that one needs to
> address all the lifecycle issues of that topic, how it works with
> topic auto-create policies, or any other policy that is set on the
> namespace or at the broker level.

Existing system topics have already forced us to solve most,
if not all, of your concerns regarding topic auto creation and
namespace policies. For example, system topics are always allowed to
be auto created. Regarding life cycle, the topic could be
created/deleted based on a new namespace policy, and the deletion of
the namespace would delete the topic.

On the other hand, maybe I shouldn't have described my alternative
solution as a "system" topic. Maybe a normal topic would be better.

Thanks,
Michael

On Thu, Mar 3, 2022 at 4:58 AM Andras Beni
<andras.b...@streamnative.io.invalid> wrote:
>
> Hi Michael,
>
> Thanks for having had a look at the proposal.
>
> I share Matteo's concerns about the system topic.
>
> > One of my concerns is that this feature only solves the problem of
> > topic discovery for regex consumers. Topic discovery is a generic
> > problem in Pulsar, and any solution we implement for the regex
> > consumer should benefit user applications that also need to discover
> > topics.
>
> I agree and I believe the change I propose can be a basis for another
> improvement where PulsarAdmin makes use of these new messages.
> For this broader use case we could either
>  - use ".*" as the regex and skip filtering on broker side when we see this
> specific pattern because we know everything will match or
>  - make topics_pattern an optional field of CommandWatchTopicList. For
> compatibility reasons this should be done now, if this is the direction we
> want to go.
>
> Thanks,
> Andras
>
>
> On Thu, Mar 3, 2022 at 3:50 AM Matteo Merli <matteo.me...@gmail.com> wrote:
>
> > On Wed, Mar 2, 2022 at 2:15 PM Michael Marshall <mmarsh...@apache.org>
> > wrote:
> >
> > > > A new class, org.apache.pulsar.TopicsListService will keep track
> > > > of watchers and will listen to changes in the metadata.
> > >
> > > I think we should avoid creating a new service to distribute
> > > notifications to consumers. Instead, we should consider using a
> > > compacted topic to store and distribute topic name information. We
> > > could have a system topic in each namespace that contains all of the
> > > non-system topics in the namespace. This solution would not expand the
> > > Pulsar protocol and would rely on core Pulsar features that are
> > > already hardened. Note that the implementation for the producer to the
> > > compacted topic of topic names would be nearly identical to the
> > > `TopicsListService` class. The main difference would be how changes in
> > > metadata are distributed.
> >
> > The problem is that this notification mechanism will not be reliable
> > and thus it will not be possible to trust the compacted view of the
> > topic.
> > The operations of "create a topic" and "publish the notification in a
> > system topic" are not atomic, therefore the notification might be
> > missing, invalidating the state.
> >
> > The proposal is to augment the polling behavior with "best-effort"
> > notifications, that will improve the discoverability in the best case
> > but that we can correct in all cases where notifications are lost.
> >
> > > This solution would not expand the
> > > Pulsar protocol and would rely on core Pulsar features that are
> > > already hardened.
> >
> > We're talking of adding a core Pulsar feature here, so it doesn't look
> > inappropriate to me for that to involve a (fully compatible) change in
> > the protocol.
> >
> > Adding a system topic is actually a bigger change compared to adding a
> > watch mechanism in the protocol. The reason is that one needs to
> > address all the lifecycle issues of that topic, how it works with
> > topic auto-create policies, or any other policy that is set on the
> > namespace or at the broker level.
> >
> >
> > > I concede that my solution does not support broker side message
> > > filtering. Given that Pulsar (intentionally) does not support broker
> > > side message filtering at this time, I think it is acceptable to skip
> > > this optimization in favor of a more generic feature.
> >
> > There is no message filtering in this proposal. Only topic list filtering.
> >

Re: [DISCUSS] PIP-145: Improve performance of regex subscriptions

Reply via email to