[DISCUSS] Transactions isolation design

Nicolò Boschi Mon, 05 Dec 2022 03:41:17 -0800

Hi folks,

I recently opened an issue about transactions [0]. The specific issue is
the client requires to be able to lookup the system topic
pulsar/system/transaction_coordinator_assign to get all the transaction
coordinators to dial with.


Since multi-tenancy is a core feature in Pulsar, this requirement may lead
to authorization issues in multi-tenant clusters breaking the tenant
isolation principle.

In this thread I'd like to discuss, more in general, the approach that has
been taken while designing transactions.
The main concern I have is that TCs are global.
A TC is backed by two system topics: transaction_coordinator_assign and
__transaction_log_x.
Both are used in the transaction's hot path with different workloads.
This leads to potential critical issues:

1. *if somehow one of these topics is unloaded, ALL the tenants using
transactions will suffer micro-outages*. (haven't looked at the error
handling but I suppose the error would be thrown in the client's face). In
general, availability and performance are not granted anymore
per-tenant/namespace.

I believe the TC should be per-tenant (maybe per-namespace?).
*Is there any strong reason why this shouldn't be possible by design?* (and
I mean, regardless of the current implementation and client-server
compatibility, we can handle them somehow but it's a detail atm)


One thing that I believe should be possible at the moment (but I'm not
sure) are cross-tenant transactions. This wouldn't be possible anymore with
per-tenant TC-

2. *the clients need lookup permission to get all the TCs*.
(transaction_coordinator_assign partitions). This can be solved in
different ways, even keeping using the TC as a system entity.

At the moment the java client, when starting, needs to get all the
available TCs to spread transactions over them. The call it does
is getPartitionedTopicMetadata to the system topic.
To fix this there are multiple ways:
a. Suggest to users to extend their own PulsarAuthorizationProvider to
always allow lookup to that particular topic. (quick, works with all the
existing clients and it only requires broker/proxy restarts without token
invalidations) However it's not builtin so this is not optimal. More
details here: [1]
b. Add a new auth action LOOKUP in order to allow cluster admins to give
this permission to their clients without affecting the produce or consume
ability. This would require only broker restarts plus operational costs for
the admin.
c. Creates a new specific endpoint (in the binary protocol) to give all the
required info to the TC client to properly initialize. This would be the
preferred solution because the permission would be granular to this
protocol call and it wouldn't require any permission changes for the
current applications. However, only new clients (and brokers) may use this
solution.

I believe the c. option would be great for the mid-term.
Anyway, if the per-tenant TC is designable, then this issue would be
resolved as well.


[0] https://github.com/apache/pulsar/issues/18716
[1] https://github.com/apache/pulsar/pull/18718


BR,
Nicolò Boschi

[DISCUSS] Transactions isolation design

Reply via email to