Our compaction isn't very sophisticated right now. We have a service that gets notifications when tables are updated and for the tables that have opted in, we load the metadata for all of the files in the changed partitions. Then we run bin packing (see BinPacking <https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/util/BinPacking.java>) and use a couple rules to determine whether to merge the partitions: bin pack sorted by file path with lookback 1 to avoid changing data clustering, don't merge bins close to the target size, don't merge unless there are at least N files to merge (to avoid high write volume). Those are basically the same rules we use for manifest maintenance.
Metadata is managed separately from data. Usually, we don't have to change this because partitions align with writes. We expire snapshots after 7 days, but we want to get that down to 3 by tracking expired snapshots in the metadata. We don't use SQL APIs for data compaction, though users could. We are building a service to compact, which will only compact if it makes sense to for a table. So in the future, we want it to have information about whether data will be read, whether there are too many files in a table, etc. to make smart choices. On Tue, Jun 4, 2019 at 4:05 PM Anton Okolnychyi <aokolnyc...@apple.com> wrote: > Not directly related to this topic, but still pretty interesting as we > mentioned the PR for rewriting manifests. > > Ryan, could you, also, share some insights on how you do compactions? Do > you compact metadata separately from bin-packing files? How frequently do > you expire snapshots? Do you expose SQL APIs for this or it is all > happening automatically? > > Thanks, > Anton > > > On 3 Jun 2019, at 22:28, Erik Wright <erik.wri...@shopify.com.INVALID> > wrote: > > > > Thanks for sharing those observations. They are very pertinent. > > > > On Mon, Jun 3, 2019 at 5:19 PM Ryan Blue <rb...@netflix.com> wrote: > > Repeated conflicts is something that we keep an eye on in our > infrastructure. We have streaming tables that are written to every 10 > minutes from multiple regions, commits to move the files back to a single > region, and compaction all happening at the same time. We don't really see > a significant problem with several writers. The manifest list files are > generally small enough that it's okay. Definitely better than keeping all > that information in the root metadata file. > > > > On Mon, Jun 3, 2019 at 2:13 PM Erik Wright <erik.wri...@shopify.com> > wrote: > > Thanks for the response, Ryan. I can certainly see the benefits of > manifest files are. I can see that with potentially long lists of valid > snapshots, each having long lists of manifest files, the mere process of > committing a new snapshot could, itself, become costly and increase the > likelihood of commit conflicts. > > > > I gather that the potential for repeated commit conflicts due to the > cost of rewriting the manifest list file after each failed attempt is not > something that has really materialized yet. > > > > On Mon, Jun 3, 2019 at 4:50 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > > Hi Erik, > > > > Manifest lists serve two purposes: > > > > • Reduce the amount of data tracked by the root metadata file > > • Provide a rough index over manifest files to cut down on > planning time > > Manifests are reused to cut down on the amount of work required in a > commit, but by doing this we end up with a large number of manifests. That > list gets expensive if it is added to the root metadata, which includes all > valid snapshots. So moving that list to its own file allows Iceberg to > avoid reading the list unless it is used, and to avoid re-writing the list > for every valid snapshot. > > > > As long as the list is written to its own file, we may as well write > metadata about partitions in each manifest so that we can skip manifests > that don’t match a query. That’s where the rough index comes from, and it > really does speed up queries. In fact, we have a new PR out to rewrite > manifests to take advantage of this: > https://github.com/apache/incubator-iceberg/pull/200/files > > > > Does that answer your question? > > > > > > On Mon, Jun 3, 2019 at 1:38 PM Erik Wright <erik.wri...@shopify.com.invalid> > wrote: > > In the process of following up on the "Updates/Deletes/Upserts" thread, > I'm re-reading the table spec. I have a question about Manifest List files. > > > > If I understand correctly, the manifest list files are separate files > that are created prior to attempting to commit a new snapshot. Each > snapshot may have a single manifest list file. The manifest list file > references _all_ manifest files included in the snapshot. > > > > During a commit collision, two writers will produce new manifest list > files. Assuming the two writes are compatible (one is append, one is > replace, for example) the loser should be able to re-process their commit > without rewriting any data files but will, nonetheless, need to rewrite > their manifest list file in addition to rewriting their snapshot file. > > > > I was under the impression that it was a design objective to minimize > the amount of work required in order to retry a commit. The inability to > compose multiple manifest list files together seems like it adds mandatory > read and write steps to almost every commit collision. > > > > Can someone clarify what the philosophy is with regards to minimizing > the cost of commit retries? > > > > Thanks! > > > > -Erik > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix