Hi Jack,

I’ve thought about streaming data files and it will work as you said. If this 
is a common practice that we can use it in any of our custom processing 
engines. I wanted to make sure that this is a common/recommended practice. Are 
we thinking about using the same approach in Python/Dask?

Thanks,
Mayur


From: Jack Ye <yezhao...@gmail.com>
Sent: Friday, December 3, 2021 4:26 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Single multi-process commit

Hi Mayur,

I think what you describe of writing in parallel and committing using a 
coordinator is the strategy used by most of the engines today. The stream of 
DataFile (statistics collected from written data files) are passed to the 
coordinator to do a single commit. In Spark, it's passed as WriteCommitMessage 
(see SparkWrite.commit). In Presto and Trino, it's passed as CommitTaskData.

I am not sure why staging and cherry-pick is needed, am I missing anything?

Thanks,
Jack Ye

On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,

Let’s say there are N (e.g. 32) distributed processes writing to different 
(non-overlapping) partitions in the same Iceberg table in parallel.
When all of them finish writing, is there a way to do a single commit (by a 
coordinator process) at the end so that either all or none is committed?

I see that there is a snapshot staging support, but can we cherry-pick multiple 
staged snapshots (as long as it is safe)?

Thanks,
Mayur

Reply via email to