> That's my point, truly independent writers (two Spark jobs, or a Spark job
> and Dremio job) means a distributed transaction. It would need yet another
> external transaction coordinator on top of both Spark and Dremio, Iceberg
> by itself
> cannot solve this.
>

I'm not ready to accept this. Iceberg already supports a set of semantics
around multiple writers committing simultaneously and how conflict
resolution is done. The same can be done here.


> By single writer, I don't mean single process, I mean multiple coordinated
> processes like Spark executors coordinated by Spark driver. The coordinator
> ensures that the data is pre-partitioned on
> each executor, and the coordinator commits the snapshot.
>
> Note however that single writer job/multiple concurrent reader jobs is
> perfectly feasible, i.e. it shouldn't be a problem to write from a Spark
> job and read from multiple Dremio queries concurrently (for example)
>

:D This is still "single process" from my perspective. That process may be
coordinating other processes to do distributed work but ultimately it is a
single process.


> I'm not sure what you mean exactly. If we can't enforce uniqueness we
> shouldn't assume it.
>

I disagree. We can specify that as a requirement and state that you'll get
unintended consequences if you provide your own keys and don't maintain
this.


> We do expect that most of the time the natural key is unique, but the
> eager and lazy with natural key designs can handle duplicates
> consistently. Basically it's not a problem to have duplicate natural keys,
> everything works fine.
>

That heavily depends on how things are implemented. For example, we may
write a bunch of code that generates internal data structures based on this
expectation. If we have to support duplicate matches, all of sudden we can
no longer size various data structures to improve performance and may be
unable to preallocate memory associated with a guaranteed completion.

Let me try and clarify each point:
>
> - lookup for query or update on a non-(partition/bucket/sort) key
> predicate implies scanning large amounts of data - because these are the
> only data structures that can narrow down the lookup, right ? One could
> argue that the min/max index (file skipping) can be applied to any column,
> but in reality if that column is not sorted the min/max intervals can have
> huge overlaps so it may be next to useless.
> - remote storage - this is a critical architecture decision -
> implementations on local storage imply a vastly different design for the
> entire system, storage and compute.
> - deleting single records per snapshot is unfeasible in eager but also
> particularly in the lazy design: each deletion creates a very small
> snapshot. Deleting 1 million records one at a time would create 1 million
> small files, and 1 million RPC calls.
>

Why is this unfeasible? If I have a dataset of 100mm files including 1mm
small files, is that a major problem? It seems like your usecase isn't one
where you want to support single record deletes but it is definitely
something important to many people.


> Eager is conceptually just lazy + compaction done, well, eagerly. The
> logic for both is exactly the same, the trade-off is just that with eager
> you implicitly compact every time so that you don't do any work on read,
> while with lazy
> you want to amortize the cost of compaction over multiple snapshots.
>
> Basically there should be no difference between the two conceptually, or
> with regard to keys, etc. The only difference is some mechanics in
> implementation.
>

I think you have deconstruct the problem too much to say these are the same
(or at least that is what I'm starting to think given this thread). It
seems like real world implementation decisions (per our discussion here)
are in conflict. For example, you just argued against having a 1mm
arbitrary mutations but I think that is because you aren't thinking about
things over time with a delta implementation. Having 10,000 mutations a day
where we do delta compaction once a week and local file mappings (key to
offset sparse bitmaps) seems like it could result in very good performance
in a case where we're mutating small amounts of data. In this scenario, you
may not do major compaction ever unless you get to a high enough percentage
of records that have been deleted in the original dataset. That drives a
very different set of implementation decisions from a situation where
you're trying to restate an entire partition at once.

Reply via email to