Thanks everybody for taking a look at the doc. FYI, I’ve updated it.

I would like to share some intermediate thoughts.

1. It seems beneficial to follow the stored procedures approach to call small 
actions like rollback or expire snapshots. Presto already allows connectors to 
define stored procedures and it will be much easier to add such syntax to other 
query engines as it is standard SQL. If we go that route, optional arguments 
and name-based arguments can make the syntax very reasonable for 
straightforward operations.

2. There are still some cases where separate commands *may* make sense. For 
example, it may be more natural to have SNAPSHOT or MIGRATE as separate 
commands. That way, we can use well-known clauses like TBLPROPERTIES. Later, we 
may build a VACUUM command with different modes to combine 3-4 actions. We have 
SNAPSHOT and MIGRATE internally and they are frequently used (especially 
SNAPSHOT). 

3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is 
unlikely we can get them into query engines even though the commands are 
generic. So, we may need to maintain them in Iceberg in a form of SQL 
extensions (e.g. extended parser via SQL extensions in Spark). That may not be 
always possible in all query engines.

4. We need to align the syntax including arg names across query engines. 
Otherwise, it will be a mess if there is a cosmetic difference in each query 
engine.

5. Spark does not have a plugin for stored procedures. There is a proposal from 
Ryan to add function catalog API. I think it is a bit different from the stored 
procedure catalog as functions are used in SELECT and procedures are used in 
CALL. While we can explore how to add such support to Spark, we most likely 
need to start with SQL extensions in Iceberg. Otherwise, we will be blocked for 
a long time.

6. Wherever possible, SQL calls must return some output that should be a 
summary of what was done. For example, if we expire snapshots, return the 
number of expired snapshots, the number of removed data and metadata files, the 
number of scanned manifests, etc. If we import a table, output the number of 
imported files, etc.

7. SQL calls must be smart. For example, we should not simply rewrite all 
metadata or data. Commands should analyze what needs to be rewritten. I’ve 
tried to outline that for metadata and will submit a doc for data compaction.

- Anton


> On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> 
> wrote:
> 
> Hi devs,
> 
> I want to start a discussion on whether we want to have some SQL extensions 
> in Iceberg that should help data engineers to invoke Iceberg-specific 
> functionality through SQL. I know companies have this internally but I would 
> like to unify this starting from Spark 3 and share the same syntax across 
> query engines to have a consistent behavior.
> 
> I’ve put together a short doc: 
> 
> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8
>  
> <https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8>
> 
> I’d appreciate everyone’s feedback. Please, feel free to comment and add 
> alternatives.
> 
> Thanks,
> Anton 

Reply via email to