Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

Ryan Blue Mon, 27 Jul 2020 18:40:00 -0700

> seems to fail on high rate writing streaming query being run on the other
side


This kind of situation is where you'd want to tune the number of retries
for a table. That's a likely source of the problem. We can also check to
make sure we're being smart about conflict detection. A rewrite needs to
scan any manifests that might have data files that conflict, which is why
retries can take a little while. The farther the rewrite is from active
data partitions, the better. And we can double-check to make sure we're
using the manifest file partition ranges to avoid doing unnecessary work.

> from the end user's point of view, all actions are not documented

Yes, we need to add documentation for the actions. If you're interested,
feel free to open PRs! The actions are fairly new, so we don't yet have
docs for them.

Same with the streaming sink, we just need someone to write up docs and
contribute them. We don't use the streaming sink, so I've unfortunately
overlooked it.

On Mon, Jul 27, 2020 at 3:25 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Thanks for the quick response!
>
> And yes I also went through experimenting expireSnapshots() and it looked
> good. I can imagine some alternative conditions on expiring snapshots (like
> adjusting "granularity" between snapshots instead of removing all snapshots
> before the specific timestamp), but for now it's just an idea and I don't
> have support for real world needs.
>
> I also went through RewriteDataFilesAction and looked good as well.
> There's an existing Github issue to make the action be more intelligent
> which is valid and good to add. One thing I indicated is that it's a bit
> time-consuming task (expected for sure, not a problem) and seems to fail on
> high rate writing streaming query being run on the other side (this is a
> concern). I guess the action is only related to the old snapshots, hence no
> conflict is expected against fast append. It would be nice to know whether
> it's an expected behavior and it's recommended to stop all writes before
> running the action, or sounds like a bug.
>
> I haven't gone through RewriteManifestAction, though for now I am only
> curious about the needs. I'm eager to experiment with a streaming source
> which is in review - don't know about the details of Iceberg so
> not qualified to participate reviewing. I'd rather play with it when
> available and use it as a chance to learn about Iceberg itself.
>
> Btw, from the end user's point of view, all actions are not documented -
> even structured streaming sink is not documented and I had to go over the
> code. While I think it's obvious to document a streaming sink on Spark doc
> (wondering why it missed the documentation), would we want to document for
> actions as well? Looks like these actions are still evolving so wondering
> whether we are waiting for stabilizing, or just missed documentation.
>
> Thanks,
> Jungtaek Lim
>
> On Tue, Jul 28, 2020 at 2:45 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Jungtaek,
>>
>> That setting controls whether Iceberg cleans up old copies of the table
>> metadata file. The metadata file holds references to all of the table's
>> snapshots (that have no expired) and is self-contained. No operations need
>> to access previous metadata files.
>>
>> Those aren't typically that large, but could be when streaming data
>> because you create a lot of versions. For streaming, I'd recommend turning
>> it on and making sure you're running `expireSnapshots()` regularly to prune
>> old table versions -- although expiring snapshots will remove them from
>> table metadata and limit how far back you can time travel.
>>
>> On Mon, Jul 27, 2020 at 4:33 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> I'm experimenting with Apache Iceberg for Structured Streaming sink -
>>> plan to experiment with source as well, but I see PR still in review.
>>>
>>> It seems that "fast append" pretty much helps to retain reasonable
>>> latency for committing, though the metadata directory grows too fast. I
>>> found the option 'write.metadata.delete-after-commit.enabled' (false by
>>> default), and disabled it, and the overall size looks fine afterwards.
>>>
>>> That said, given the option is false by default, I'm wondering which
>>> would be impacted when turning off this option. My understanding is that it
>>> doesn't affect time-travel (as it refers to a snapshot), and restoring is
>>> also from snapshot, so not sure which point to consider when turning on the
>>> option.
>>>
>>> Thanks,
>>> Jungtaek Lim
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

Reply via email to