Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

Jungtaek Lim Tue, 28 Jul 2020 22:24:23 -0700

> If your commit is taking 13 seconds to produce a retry, then it must be
that you have a lot of metadata. That makes sense if you're producing a new
snapshot every 10 seconds. It could be that at that rate, you have a large
number of manifests and a large number of snapshots, causing a large
metadata file. To solve the problem, I recommend using expireSnapshots
regularly to keep the metadata from growing too large. That should keep
commits quick and allow this to complete.


I ran expireSnapshots (again) and also RewriteManifestAction, so now it
only contains 874 snapshots (suppose we trigger per 1 min then we'll have
1,440 snapshots per day, so should be acceptable number in practice) and
563 manifests (via querying on table inspection) - but even after the
optimization the commit for RewriteDataAction seems to took similar and
finally fails. The number of files doesn't change and still increases (more
than 2 millions), but my understanding is that it won't impact much on the
commit phase as manifests are optimized. Do I understand correctly? On the
other hand, the optimization goes well - the commit phase in running
streaming query took under 100 ms.

> Every 10 seconds is a really rapid pace for commits in Iceberg, by the
way. To support that kind of rate, I think you might need a better way to
track the root metadata and snapshots than the current JSON file. Making
this better and being able to plug in different metadata stores is why we
built an interface for this, TableOperations
<https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableOperations.java>.
By implementing that, you could use a database for your metadata instead of
a JSON file, which could be faster for this use case. That's also what
you'd implement to inject a different FileIO library, or to change data
file placement.

Ah yes, while I'm experimenting with HadoopCatalog, eventually we'd deal
with HiveCatalog in many cases (probably by default if it's feasible?), as
supporting S3 is also a one of requirements. From what I've skimmed on the
codebase for HiveCatalog, it looks to leverage exclusive lock, which looks
OK to cover the case (at least in above case), though the latency on
dealing with HMS (locking, and alter table) would add up the latency of the
commit phase. (it wouldn't matter if the latency is reasonable for
streaming workload case as well)

On Wed, Jul 29, 2020 at 4:21 AM Ryan Blue <rb...@netflix.com> wrote:

> I verified that the rewrite action will correctly avoid filtering
> manifests that can't contain the deleted files, so retries should be okay.
> (The relevant part of the code is in ManifestFilterManager
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L329-L332>
> )
>
> If your commit is taking 13 seconds to produce a retry, then it must be
> that you have a lot of metadata. That makes sense if you're producing a new
> snapshot every 10 seconds. It could be that at that rate, you have a large
> number of manifests and a large number of snapshots, causing a large
> metadata file. To solve the problem, I recommend using expireSnapshots
> regularly to keep the metadata from growing too large. That should keep
> commits quick and allow this to complete.
>
> Every 10 seconds is a really rapid pace for commits in Iceberg, by the
> way. To support that kind of rate, I think you might need a better way to
> track the root metadata and snapshots than the current JSON file. Making
> this better and being able to plug in different metadata stores is why we
> built an interface for this, TableOperations
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableOperations.java>.
> By implementing that, you could use a database for your metadata instead of
> a JSON file, which could be faster for this use case. That's also what
> you'd implement to inject a different FileIO library, or to change data
> file placement.
>
> On Tue, Jul 28, 2020 at 12:56 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> The case of keep failing on updating the result of rewriting data files
>> isn't the matter of the number of retries, but the matter of latency on
>> constructing new commit when conflicting happens.
>>
>> I haven't looked into details how the new commit is constructed with
>> reflecting new metadata, but it took 13 seconds to construct and try to
>> commit, while the trigger is set to 10 seconds and it took less than 1
>> second per each batch. So it always loses on the competition of the
>> optimistic lock. It's still on experiment (single VM node & local FS) and
>> I'm not sure it could be the real case end users might encounter the case
>> (especially in production), but probably good to know in prior and address
>> if possible. (optimize the latency if possible, or run rewriting data files
>> periodically in streaming writer with reducing latency or making it be
>> background)
>>
>> On Tue, Jul 28, 2020 at 12:41 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'd love to contribute documentation about the actions - just need some
>>> time to understand the needs for some actions (like RewriteManifestAction).
>>>
>>> I just submitted a PR for structured streaming sink [1]. I mentioned
>>> expireSnapshot() there with linking javadoc page, but it'd be nice if
>>> there's also a code example in the API page, as it'd not be bound for
>>> Spark's case (any fast changing cases including Flink streaming would
>>> need this as well).
>>>
>>> 1. https://github.com/apache/iceberg/pull/1261
>>>
>>> On Tue, Jul 28, 2020 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> > seems to fail on high rate writing streaming query being run on the
>>>> other side
>>>>
>>>> This kind of situation is where you'd want to tune the number of
>>>> retries for a table. That's a likely source of the problem. We can also
>>>> check to make sure we're being smart about conflict detection. A rewrite
>>>> needs to scan any manifests that might have data files that conflict, which
>>>> is why retries can take a little while. The farther the rewrite is from
>>>> active data partitions, the better. And we can double-check to make sure
>>>> we're using the manifest file partition ranges to avoid doing unnecessary
>>>> work.
>>>>
>>>> > from the end user's point of view, all actions are not documented
>>>>
>>>> Yes, we need to add documentation for the actions. If you're
>>>> interested, feel free to open PRs! The actions are fairly new, so we don't
>>>> yet have docs for them.
>>>>
>>>> Same with the streaming sink, we just need someone to write up docs and
>>>> contribute them. We don't use the streaming sink, so I've unfortunately
>>>> overlooked it.
>>>>
>>>> On Mon, Jul 27, 2020 at 3:25 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Thanks for the quick response!
>>>>>
>>>>> And yes I also went through experimenting expireSnapshots() and it
>>>>> looked good. I can imagine some alternative conditions on expiring
>>>>> snapshots (like adjusting "granularity" between snapshots instead of
>>>>> removing all snapshots before the specific timestamp), but for now it's
>>>>> just an idea and I don't have support for real world needs.
>>>>>
>>>>> I also went through RewriteDataFilesAction and looked good as well.
>>>>> There's an existing Github issue to make the action be more intelligent
>>>>> which is valid and good to add. One thing I indicated is that it's a bit
>>>>> time-consuming task (expected for sure, not a problem) and seems to fail 
>>>>> on
>>>>> high rate writing streaming query being run on the other side (this is a
>>>>> concern). I guess the action is only related to the old snapshots, hence 
>>>>> no
>>>>> conflict is expected against fast append. It would be nice to know whether
>>>>> it's an expected behavior and it's recommended to stop all writes before
>>>>> running the action, or sounds like a bug.
>>>>>
>>>>> I haven't gone through RewriteManifestAction, though for now I am only
>>>>> curious about the needs. I'm eager to experiment with a streaming source
>>>>> which is in review - don't know about the details of Iceberg so
>>>>> not qualified to participate reviewing. I'd rather play with it when
>>>>> available and use it as a chance to learn about Iceberg itself.
>>>>>
>>>>> Btw, from the end user's point of view, all actions are not documented
>>>>> - even structured streaming sink is not documented and I had to go over 
>>>>> the
>>>>> code. While I think it's obvious to document a streaming sink on Spark doc
>>>>> (wondering why it missed the documentation), would we want to document for
>>>>> actions as well? Looks like these actions are still evolving so wondering
>>>>> whether we are waiting for stabilizing, or just missed documentation.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim
>>>>>
>>>>> On Tue, Jul 28, 2020 at 2:45 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi Jungtaek,
>>>>>>
>>>>>> That setting controls whether Iceberg cleans up old copies of the
>>>>>> table metadata file. The metadata file holds references to all of the
>>>>>> table's snapshots (that have no expired) and is self-contained. No
>>>>>> operations need to access previous metadata files.
>>>>>>
>>>>>> Those aren't typically that large, but could be when streaming data
>>>>>> because you create a lot of versions. For streaming, I'd recommend 
>>>>>> turning
>>>>>> it on and making sure you're running `expireSnapshots()` regularly to 
>>>>>> prune
>>>>>> old table versions -- although expiring snapshots will remove them from
>>>>>> table metadata and limit how far back you can time travel.
>>>>>>
>>>>>> On Mon, Jul 27, 2020 at 4:33 AM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi devs,
>>>>>>>
>>>>>>> I'm experimenting with Apache Iceberg for Structured Streaming sink
>>>>>>> - plan to experiment with source as well, but I see PR still in review.
>>>>>>>
>>>>>>> It seems that "fast append" pretty much helps to retain reasonable
>>>>>>> latency for committing, though the metadata directory grows too fast. I
>>>>>>> found the option 'write.metadata.delete-after-commit.enabled' (false by
>>>>>>> default), and disabled it, and the overall size looks fine afterwards.
>>>>>>>
>>>>>>> That said, given the option is false by default, I'm wondering which
>>>>>>> would be impacted when turning off this option. My understanding is 
>>>>>>> that it
>>>>>>> doesn't affect time-travel (as it refers to a snapshot), and restoring 
>>>>>>> is
>>>>>>> also from snapshot, so not sure which point to consider when turning on 
>>>>>>> the
>>>>>>> option.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

Reply via email to