Hi Ashish,
Here is my thinking:
IIUC, Spark Writer (Record writer) also buffer files as Iceberg dataFiles,
for every micro-batch, Spark:
- Closes DataFiles in the end (One task one file at least if task has
records)
- Collect them into Driver side, do a snapshot commit.
So, you can choose the tr
Thanks Ryan! It sounds like exactly the problem we are hitting, but we are
entirely in Spark domain. i.e. each small files are validated using Spark
and are buffered right now as parquet table, we are looking to buffer them
as Iceberg dataFile, so that the commit operation doesn't have to read
thos
Hi Ashish,
You might try the approach that we took for the Flink writer. In Flink, we
have multiple tasks writing data files. When a checkpoint completes, the
data files are closed and a DataFile instance with all of the Iceberg
metadata is sent to a committer task. Once all the DataFile instances
Hi Team,
We had a use case of ingesting a high frequency of small files into
Iceberg. I know based on the documentation I see that Iceberg is for
slow-moving data, but the kind of feature around reader/writer isolation,
etc Iceberg has, we really need it for our high-frequency small files (even
th