Re: Iceberg with high frequency data!

2020-07-15 Thread Jingsong Li
Hi Ashish, Here is my thinking: IIUC, Spark Writer (Record writer) also buffer files as Iceberg dataFiles, for every micro-batch, Spark: - Closes DataFiles in the end (One task one file at least if task has records) - Collect them into Driver side, do a snapshot commit. So, you can choose the tr

Re: Iceberg with high frequency data!

2020-07-15 Thread Ashish Mehta
Thanks Ryan! It sounds like exactly the problem we are hitting, but we are entirely in Spark domain. i.e. each small files are validated using Spark and are buffered right now as parquet table, we are looking to buffer them as Iceberg dataFile, so that the commit operation doesn't have to read thos

Re: Iceberg with high frequency data!

2020-07-15 Thread Ryan Blue
Hi Ashish, You might try the approach that we took for the Flink writer. In Flink, we have multiple tasks writing data files. When a checkpoint completes, the data files are closed and a DataFile instance with all of the Iceberg metadata is sent to a committer task. Once all the DataFile instances

Iceberg with high frequency data!

2020-07-15 Thread Ashish Mehta
Hi Team, We had a use case of ingesting a high frequency of small files into Iceberg. I know based on the documentation I see that Iceberg is for slow-moving data, but the kind of feature around reader/writer isolation, etc Iceberg has, we really need it for our high-frequency small files (even th