Hi Filip,

I think I understand what you're describing, but let me know if this
response doesn't make sense.

In a lot of Hadoop deployments, users see and interact with file paths. For
example, if a directory exists, that may trigger downstream processing that
sees that a new partition is available in some table. Direct use of files
and partitions is also used in other places. We have a Hive table pattern
where we write new partitions to a staging table, audit the partitions, and
then publish them to a final table without moving the actual data files. I
think you're talking about similar use cases.

You're right that in Iceberg, users should not modify or directly interact
with files underneath tables. I think of this like files underneath a
PostgreSQL instance: it is rare to make changes without going through the
database. But in this case, how do we migrate the patterns that rely on the
Hive table layout?

I think that those patterns need to be rebuilt to use the table
abstraction. We re-implemented the audit pattern that I described so that
the write creates a new snapshot in the output table without making it the
current table state. Then the audit runs on that snapshot, and updates the
table metadata to make it the current table state when audits pass. This
doesn't actually need to be partition based, that was just the tool that we
had available in Hive tables.

It is also better to stop reaching into the underlying table structure. By
writing these patterns to the table abstraction, we separate the logical
use of the data from its physical layout. That's the first step to being
able to evolve that physical layout. If your audits depend on daily
partitions, then you have to completely rewrite the pipeline to move to
hourly partitioning, but with Iceberg tables we avoid this by not exposing
the partitions directly.

I hope that helps,

rb

On Fri, Jun 7, 2019 at 2:00 AM Filip <filip....@gmail.com> wrote:

> Hi devs,
>
> I need help trying to figure out if and how adopting Iceberg fits into the
> picture of a generic architecture of a data-lake. The use-case is very
> broad so please excuse the abrupt and naive approach of attempting to
> summarize in a short email but I'll start with a rundown of the general
> use-case and try to narrow it down a bit by the time I ask specific
> questions about Iceberg support...
>
> A generic approach to the architecture of a data-lake generally involves
> (at least) two stages for landing data before making it accessible for
> querying (terminology varies, a lot, ranging from zones, raw vs processed
> stores, ingestion tier and insights tier, etc.).
> Data is usually undergoing a particular set of transformations across
> these stages either successfully advancing to the next stage or forfeiting
> the promotion process - in either case there's a metadata operation
> involved recording status.
> When such a transformation is successful data is generally promoted to the
> next stage via a data move operation or metadata operation depending on the
> underlying file system implementation - either way it's a file path change.
>
> Adopting Iceberg as a data writer in any of the earlier stages would imply
> promoting Iceberg table changes across along with the promotion of the
> actual data files - so that eventually consumers can reliably make use of
> the Iceberg format.
>
> Taking the naive approach of having corresponding Iceberg tables across
> various stages I was wondering if there's any support for promoting commits
> across two Iceberg tables just by "tweaking" file paths for data files as a
> meta-data operation alone? Is that achievable with Iceberg today? This
> relates to my question on extending the API support based on Ryan's PR I
> brought up initially in the message, for this particular use-case
> supporting only append files commits would suffice.
>
> As a side note (but probably having considerable implications to the topic
> at hand) - after reading up on the "Updates/Deletes/Upserts in Iceberg" and
> trying to reason about the implications of implementing it I kind of got
> the feeling file paths become an Iceberg concern entirely, totally opaque
> to the consumer, either from the write or read path. Also I believe that
> data compaction can no longer be an external process either and it'd have
> to understand Iceberg data file semantics. These two implications would
> have a considerable impact on the adoption of Iceberg given a generic
> data-lake architecture. Are these all false impressions/ assumptions I'm
> making here?
>
> *Question*: Should Iceberg concern itself with supporting such use-cases
> to accommodate embedding it in a generic data-lake architecture in the
> first place, thinking solely from an adoption pov?
>
> If anyone else has been giving some thought to this and maybe either
> figured some stuff out or wants to share ideas on the topic please do.
>
> [1] https://github.com/apache/incubator-iceberg/pull/201
>
> --
> /Filip
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to