Thanks for the quick response!

And yes I also went through experimenting expireSnapshots() and it looked
good. I can imagine some alternative conditions on expiring snapshots (like
adjusting "granularity" between snapshots instead of removing all snapshots
before the specific timestamp), but for now it's just an idea and I don't
have support for real world needs.

I also went through RewriteDataFilesAction and looked good as well. There's
an existing Github issue to make the action be more intelligent which is
valid and good to add. One thing I indicated is that it's a bit
time-consuming task (expected for sure, not a problem) and seems to fail on
high rate writing streaming query being run on the other side (this is a
concern). I guess the action is only related to the old snapshots, hence no
conflict is expected against fast append. It would be nice to know whether
it's an expected behavior and it's recommended to stop all writes before
running the action, or sounds like a bug.

I haven't gone through RewriteManifestAction, though for now I am only
curious about the needs. I'm eager to experiment with a streaming source
which is in review - don't know about the details of Iceberg so
not qualified to participate reviewing. I'd rather play with it when
available and use it as a chance to learn about Iceberg itself.

Btw, from the end user's point of view, all actions are not documented -
even structured streaming sink is not documented and I had to go over the
code. While I think it's obvious to document a streaming sink on Spark doc
(wondering why it missed the documentation), would we want to document for
actions as well? Looks like these actions are still evolving so wondering
whether we are waiting for stabilizing, or just missed documentation.

Thanks,
Jungtaek Lim

On Tue, Jul 28, 2020 at 2:45 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Jungtaek,
>
> That setting controls whether Iceberg cleans up old copies of the table
> metadata file. The metadata file holds references to all of the table's
> snapshots (that have no expired) and is self-contained. No operations need
> to access previous metadata files.
>
> Those aren't typically that large, but could be when streaming data
> because you create a lot of versions. For streaming, I'd recommend turning
> it on and making sure you're running `expireSnapshots()` regularly to prune
> old table versions -- although expiring snapshots will remove them from
> table metadata and limit how far back you can time travel.
>
> On Mon, Jul 27, 2020 at 4:33 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> Hi devs,
>>
>> I'm experimenting with Apache Iceberg for Structured Streaming sink -
>> plan to experiment with source as well, but I see PR still in review.
>>
>> It seems that "fast append" pretty much helps to retain reasonable
>> latency for committing, though the metadata directory grows too fast. I
>> found the option 'write.metadata.delete-after-commit.enabled' (false by
>> default), and disabled it, and the overall size looks fine afterwards.
>>
>> That said, given the option is false by default, I'm wondering which
>> would be impacted when turning off this option. My understanding is that it
>> doesn't affect time-travel (as it refers to a snapshot), and restoring is
>> also from snapshot, so not sure which point to consider when turning on the
>> option.
>>
>> Thanks,
>> Jungtaek Lim
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to