That looks like a good plan to me. Initially using stored procedures and
adding custom syntax where possible sounds like a good way to start.

For Spark, I agree that we can start exploring a plugin that can extend
Spark's syntax. Having that done will make development faster and make it
easier to get this upstream, I think.

On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> Thanks everybody for taking a look at the doc. FYI, I’ve updated it.
>
> I would like to share some intermediate thoughts.
>
> 1. It seems beneficial to follow the stored procedures approach to call
> small actions like rollback or expire snapshots. Presto already allows
> connectors to define stored procedures and it will be much easier to add
> such syntax to other query engines as it is standard SQL. If we go that
> route, optional arguments and name-based arguments can make the syntax very
> reasonable for straightforward operations.
>
> 2. There are still some cases where separate commands *may* make sense.
> For example, it may be more natural to have SNAPSHOT or MIGRATE as separate
> commands. That way, we can use well-known clauses like TBLPROPERTIES.
> Later, we may build a VACUUM command with different modes to combine 3-4
> actions. We have SNAPSHOT and MIGRATE internally and they are frequently
> used (especially SNAPSHOT).
>
> 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is
> unlikely we can get them into query engines even though the commands are
> generic. So, we may need to maintain them in Iceberg in a form of SQL
> extensions (e.g. extended parser via SQL extensions in Spark). That may
> not be always possible in all query engines.
>
> 4. We need to align the syntax including arg names across query engines.
> Otherwise, it will be a mess if there is a cosmetic difference in each
> query engine.
>
> 5. Spark does not have a plugin for stored procedures. There is a proposal
> from Ryan to add function catalog API. I think it is a bit different from
> the stored procedure catalog as functions are used in SELECT and procedures
> are used in CALL. While we can explore how to add such support to Spark, we
> most likely need to start with SQL extensions in Iceberg. Otherwise, we
> will be blocked for a long time.
>
> 6. Wherever possible, SQL calls must return some output that should be a
> summary of what was done. For example, if we expire snapshots, return the
> number of expired snapshots, the number of removed data and metadata files,
> the number of scanned manifests, etc. If we import a table, output the
> number of imported files, etc.
>
> 7. SQL calls must be smart. For example, we should not simply rewrite all
> metadata or data. Commands should analyze what needs to be rewritten. I’ve
> tried to outline that for metadata and will submit a doc for data
> compaction.
>
> - Anton
>
>
> On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnyc...@apple.com.INVALID>
> wrote:
>
> Hi devs,
>
> I want to start a discussion on whether we want to have some SQL
> extensions in Iceberg that should help data engineers to invoke
> Iceberg-specific functionality through SQL. I know companies have this
> internally but I would like to unify this starting from Spark 3 and share
> the same syntax across query engines to have a consistent behavior.
>
> I’ve put together a short doc:
>
>
> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8
>
> I’d appreciate everyone’s feedback. Please, feel free to comment and add
> alternatives.
>
> Thanks,
> Anton
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to