I think we agree on the general shape of the design as follows.

- Transactions store message data somewhere that it is not
materialized to the client immediately.
- On commit, a single message is written to some log
- Commit messages are then written to the topics of the logs in the
transaction, which point to the message data and which materialize it
to the client?

My replies inline assume the above, so if you have a different view of
the general shape let me know.

> Database is very different from a streaming system. Database storage is
> optimized for randomized
> accesses, where data tends to be organized in key + version ordering, where
> interleaving committed
> and uncommitted data is fine.
> However in a streaming system, interleaving means you have to go
> back-and-forth between transactions,

Whether we put it interleaved or in a separate ledger, we are going to
have to do a random read. The only way to avoid a random read is to
rewrite all the transaction messages after the transaction has
committed, which means write amplification.

> I was thinking of one (or multiple) ledger per partition per transaction.
> this is the simplest solution, where you can delete ledger to abort a
> transaction. However it will put a pressure to metadata, but we can address
> that by using table service to store those metadata as well as transaction
> statuses.

This is a huge amount of pressure on metadata. With the current
metadata service, this is 4 ZK writes to commit a transaction (id,
create, close and anchor it somewhere).
Then you need to delete the data when the topic gets truncated. You
need some way to correlate which data ledgers correspond to the
messages being truncated from the topic, and then the metadata store
needs to deal with a massive number of delete operations in a short
time.
I also wouldn't consider table service mature enough to put ledger
metadata on it yet. It's new, undocumented, and has very little
production exposure so far. Making a hard dependency on table service,
would delay transactions by at least a year. Putting transaction
components themselves on table service is fine, as transactions are
also a new service, but ledger metadata is the very core of the
system.

> We have a few benefits 1) make sure the entries of a same
> transaction in the same partitions are kind of continuously stored on
> bookies. because they will be ordered by ledgerId + entryId.

Clients will not care if the entries from the same transaction are
contiguous on disk or not. Clients reading are reading from a topic,
so unless there are multiple entries in the topic from the same
transaction, they'll only read one entry at a time.

>2) the
> metadata and transaction status part are handled by the table service,
> where we are able to support large number of outstanding transactions as it
> scales.

How do you clean up old transactions from table service?

> We can share a ledger across transactions, however it will be similar as
> interleaving transactions into the data ledgers. You will end up building
> indexing, caching and compaction logic, which has already implemented on
> bookies and table service.

I think having a ledger, or a shadow topic, per topic would work fine.
There's no need for indexing. We can already look up  an message by
messageid. This message id should be part of the transaction commit
message, and subsequently part of the commit marker written to each
topic involved in the transaction. Caching is no different to any
other choice of storage for the message data. Compaction is unneeded.
No matter what design we end up with, we will need to put a limit on
the lifetime of transactions. This will create an upper bound on the
amount of extra time we need to retain the shadow topic relative to
the main topic.

-Ivan

Reply via email to