Hi folks, I recently opened an issue about transactions [0]. The specific issue is the client requires to be able to lookup the system topic pulsar/system/transaction_coordinator_assign to get all the transaction coordinators to dial with.
Since multi-tenancy is a core feature in Pulsar, this requirement may lead to authorization issues in multi-tenant clusters breaking the tenant isolation principle. In this thread I'd like to discuss, more in general, the approach that has been taken while designing transactions. The main concern I have is that TCs are global. A TC is backed by two system topics: transaction_coordinator_assign and __transaction_log_x. Both are used in the transaction's hot path with different workloads. This leads to potential critical issues: 1. *if somehow one of these topics is unloaded, ALL the tenants using transactions will suffer micro-outages*. (haven't looked at the error handling but I suppose the error would be thrown in the client's face). In general, availability and performance are not granted anymore per-tenant/namespace. I believe the TC should be per-tenant (maybe per-namespace?). *Is there any strong reason why this shouldn't be possible by design?* (and I mean, regardless of the current implementation and client-server compatibility, we can handle them somehow but it's a detail atm) One thing that I believe should be possible at the moment (but I'm not sure) are cross-tenant transactions. This wouldn't be possible anymore with per-tenant TC- 2. *the clients need lookup permission to get all the TCs*. (transaction_coordinator_assign partitions). This can be solved in different ways, even keeping using the TC as a system entity. At the moment the java client, when starting, needs to get all the available TCs to spread transactions over them. The call it does is getPartitionedTopicMetadata to the system topic. To fix this there are multiple ways: a. Suggest to users to extend their own PulsarAuthorizationProvider to always allow lookup to that particular topic. (quick, works with all the existing clients and it only requires broker/proxy restarts without token invalidations) However it's not builtin so this is not optimal. More details here: [1] b. Add a new auth action LOOKUP in order to allow cluster admins to give this permission to their clients without affecting the produce or consume ability. This would require only broker restarts plus operational costs for the admin. c. Creates a new specific endpoint (in the binary protocol) to give all the required info to the TC client to properly initialize. This would be the preferred solution because the permission would be granular to this protocol call and it wouldn't require any permission changes for the current applications. However, only new clients (and brokers) may use this solution. I believe the c. option would be great for the mid-term. Anyway, if the per-tenant TC is designable, then this issue would be resolved as well. [0] https://github.com/apache/pulsar/issues/18716 [1] https://github.com/apache/pulsar/pull/18718 BR, Nicolò Boschi