Thanks for bringing this up and for the clear description of the problem,
Marton.

We do try to avoid list operations like you're suggesting. I think that it
tends to be used when we're trying to work around something like the
situation you described here. This isn't _that_ bad now that S3 provides
listing consistency, but I would still prefer to work another way if
possible. Can we not put extra metadata about the commit in the config like
we do for other things? It seems a bit odd that the output committer isn't
given enough information to commit correctly and I think it would make more
sense to solve that problem directly.

On Mon, Feb 22, 2021 at 4:16 AM Marton Bod <[email protected]> wrote:

> Hi Team,
>
> We are starting to implement insert overwrites for Iceberg tables in Hive.
> The current situation is that we are committing our inserts on the
> TezAM/Application master (MR) side, where we have no information whether
> the insert query was an insert overwrite or not. To make insert overwrites
> work, we have to migrate our job commit logic to the HS2 side into the
> HiveMetaHook, which does provide this overwrite flag that we need.
>
> However, in the HiveMetaHook, we lack some crucial information for the
> commit that we previously relied on, such as the JobID, the Tez VertexId,
> and the number of map/reduce tasks that have produced data files. What we
> do have access to is the table location and the query id. So our solution
> would be to collect all the information under the tableLocation/queryId
> folder during Tez/MR execution, and then in the HiveMetaHook, use file
> listing to get the contents of that folder - which would provide us with
> all the info we need for the commit to work reliably.
>
> This would mean a single listing operation per query, so while there's a
> performance overhead, it shouldn't be significant. Also, now that S3
> listing is consistent, it would be safe to rely on the results. However,
> given how the project has previously tried to minimize listing operations,
> I wanted to get your opinions on this, whether you have any objections or
> see any risks.
>
> Thanks a lot,
> Marton
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to