Hi Jack, I’ve thought about streaming data files and it will work as you said. If this is a common practice that we can use it in any of our custom processing engines. I wanted to make sure that this is a common/recommended practice. Are we thinking about using the same approach in Python/Dask?
Thanks, Mayur From: Jack Ye <yezhao...@gmail.com> Sent: Friday, December 3, 2021 4:26 PM To: Iceberg Dev List <dev@iceberg.apache.org> Subject: Re: Single multi-process commit Hi Mayur, I think what you describe of writing in parallel and committing using a coordinator is the strategy used by most of the engines today. The stream of DataFile (statistics collected from written data files) are passed to the coordinator to do a single commit. In Spark, it's passed as WriteCommitMessage (see SparkWrite.commit). In Presto and Trino, it's passed as CommitTaskData. I am not sure why staging and cherry-pick is needed, am I missing anything? Thanks, Jack Ye On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava <mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote: Hi, Let’s say there are N (e.g. 32) distributed processes writing to different (non-overlapping) partitions in the same Iceberg table in parallel. When all of them finish writing, is there a way to do a single commit (by a coordinator process) at the end so that either all or none is committed? I see that there is a snapshot staging support, but can we cherry-pick multiple staged snapshots (as long as it is safe)? Thanks, Mayur