Hi all,

Thanks for allowing me to bring up the topic about multiple table snapshot 
isolation in the community sync. 

I did some research and found out the batch load API proposal from Steven Wu - 
https://lists.apache.org/thread/wbtnjsm59ocdgtfdn0rrpfg8gj7d7qg9 The proposal 
doesn't touch the transaction perspective of the API. And I think during the 
community sync one of the key question is whether or not we should make this 
API transactional and support snapshot isolation.

I think there are a few good reasons that we should make it transactional or at 
least making transactional as an option:

1. Without a transactional batch get, we currently have no way to achieve SI 
for multi table statement. Our current `loadTable` API called sequentially 
basically gives us the Read Committed isolation level. This violates the spec 
definition for table properties - which only allows `Snapshot` or 
`Serializable`. In our data system (AWS Redshift), multi table statements 
represent a large percentage of total queries we see in the fleet. With current 
implementation, all these queries are potential running at a much weaker Read 
Committed level then they were designed to be.

2. We already have the multi table commit API - 
/v1/{prefix}/transactions/commit which requires commit to be done within an 
atomic transaction. So the transactional requirement for Catalog store is 
already there. It’s not new. And we should just leverage this property to give 
us SI for batch load.

Regarding the CSN (Catalog Sequence Number) alternative, I also replied to 
Maninder’s comments in the proposal doc - 
https://docs.google.com/document/d/1u11b4pzeFUKD0XX--nHPj-DoYcNeCgOe94WKCaX2XMI/edit?tab=t.0
 

My high level take away is in order to implement CSN, the metadata json file 
needs to be generated, or rewritten, by the catalog service at the time of 
commit. This would most definitely require all commits to go through IRC, which 
doesn’t seem to be something will happen soon. Even if we plan for a long term 
CSN solution, the transaction read/write support on the catalog store is still 
required - for example we would need a single SI transaction to update CSN for 
an atomic commit. So from that perspective, I don’t think these two approaches 
are conflicting: the batch load API can return a snapshot view of objects 
`as-of-current` state, and in the future, if the object state contains a list 
of CSNs, client can also choose to load a historical snapshot by aligning the 
CSNs from multiple objects.

Let’s continue this discussion. If people are aligned with providing a 
transactional batch load API, I can work with Steven on the API proposal for 
the details.

Reply via email to