Hi Jack, I’ve thought about streaming data files and it will work as you said. If this is a common practice that we can use it in any of our custom processing engines. I wanted to make sure that this is a common/recommended practice. Are we thinking about using the same approach in Python/Dask?
Thanks, Mayur From: Jack Ye <[email protected]> Sent: Friday, December 3, 2021 4:26 PM To: Iceberg Dev List <[email protected]> Subject: Re: Single multi-process commit Hi Mayur, I think what you describe of writing in parallel and committing using a coordinator is the strategy used by most of the engines today. The stream of DataFile (statistics collected from written data files) are passed to the coordinator to do a single commit. In Spark, it's passed as WriteCommitMessage (see SparkWrite.commit). In Presto and Trino, it's passed as CommitTaskData. I am not sure why staging and cherry-pick is needed, am I missing anything? Thanks, Jack Ye On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava <[email protected]<mailto:[email protected]>> wrote: Hi, Let’s say there are N (e.g. 32) distributed processes writing to different (non-overlapping) partitions in the same Iceberg table in parallel. When all of them finish writing, is there a way to do a single commit (by a coordinator process) at the end so that either all or none is committed? I see that there is a snapshot staging support, but can we cherry-pick multiple staged snapshots (as long as it is safe)? Thanks, Mayur
