Re: Micro-Batch Streaming

Ryan Blue Mon, 06 May 2019 12:19:14 -0700

We've been building pipelines that write to Iceberg tables from Flink.
Right now, we have applications deployed across 3 AWS regions and have them
committing every 10 minutes. We also have an application that monitors the
tables and moves files from remote regions into the region where we run our
Hadoop clusters, and one that is automatically merging small files in the
background.

What we see is commits about every 2-3 minutes on average, with some
periodic conflicts when writes across regions happen at the same time. The
minimum number of seconds between commits is 1 and it isn't uncommon to see
commits less than 10 seconds apart. My interpretation of this is that
commit retries for appends are reasonably fast -- fast enough to support
streaming writes every few minutes apart.

I think these stats mean that we could definitely support structured
streaming use cases. And, we could also use a table's snapshot history to
support reading from an Iceberg table as a streaming source.

On Mon, May 6, 2019 at 4:18 AM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> Hi,
>
> I would like to discuss the support for micro-batch streaming in Iceberg.
>
> First of all, do we think micro-batch use-cases are appropriate for
> Iceberg? What do we consider as "slow-moving data"? Do we want to support
> batch intervals of 30s? What about intervals of 1m/2m/5m? The latter seems
> doable as we already have FastAppend and other optimizations in place.
>
> In particular, I am interested in Spark structured streaming. I have a few
> things I want to discuss, but let's confirm it is appropriate for Iceberg.
>
> Thanks,
> Anton
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Micro-Batch Streaming

Reply via email to