Thanks for bringing this up and for the clear description of the problem, Marton.
We do try to avoid list operations like you're suggesting. I think that it tends to be used when we're trying to work around something like the situation you described here. This isn't _that_ bad now that S3 provides listing consistency, but I would still prefer to work another way if possible. Can we not put extra metadata about the commit in the config like we do for other things? It seems a bit odd that the output committer isn't given enough information to commit correctly and I think it would make more sense to solve that problem directly. On Mon, Feb 22, 2021 at 4:16 AM Marton Bod <[email protected]> wrote: > Hi Team, > > We are starting to implement insert overwrites for Iceberg tables in Hive. > The current situation is that we are committing our inserts on the > TezAM/Application master (MR) side, where we have no information whether > the insert query was an insert overwrite or not. To make insert overwrites > work, we have to migrate our job commit logic to the HS2 side into the > HiveMetaHook, which does provide this overwrite flag that we need. > > However, in the HiveMetaHook, we lack some crucial information for the > commit that we previously relied on, such as the JobID, the Tez VertexId, > and the number of map/reduce tasks that have produced data files. What we > do have access to is the table location and the query id. So our solution > would be to collect all the information under the tableLocation/queryId > folder during Tez/MR execution, and then in the HiveMetaHook, use file > listing to get the contents of that folder - which would provide us with > all the info we need for the commit to work reliably. > > This would mean a single listing operation per query, so while there's a > performance overhead, it shouldn't be significant. Also, now that S3 > listing is consistent, it would be safe to rely on the results. However, > given how the project has previously tried to minimize listing operations, > I wanted to get your opinions on this, whether you have any objections or > see any risks. > > Thanks a lot, > Marton > > -- Ryan Blue Software Engineer Netflix
