That looks like a good plan to me. Initially using stored procedures and adding custom syntax where possible sounds like a good way to start.
For Spark, I agree that we can start exploring a plugin that can extend Spark's syntax. Having that done will make development faster and make it easier to get this upstream, I think. On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > Thanks everybody for taking a look at the doc. FYI, I’ve updated it. > > I would like to share some intermediate thoughts. > > 1. It seems beneficial to follow the stored procedures approach to call > small actions like rollback or expire snapshots. Presto already allows > connectors to define stored procedures and it will be much easier to add > such syntax to other query engines as it is standard SQL. If we go that > route, optional arguments and name-based arguments can make the syntax very > reasonable for straightforward operations. > > 2. There are still some cases where separate commands *may* make sense. > For example, it may be more natural to have SNAPSHOT or MIGRATE as separate > commands. That way, we can use well-known clauses like TBLPROPERTIES. > Later, we may build a VACUUM command with different modes to combine 3-4 > actions. We have SNAPSHOT and MIGRATE internally and they are frequently > used (especially SNAPSHOT). > > 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is > unlikely we can get them into query engines even though the commands are > generic. So, we may need to maintain them in Iceberg in a form of SQL > extensions (e.g. extended parser via SQL extensions in Spark). That may > not be always possible in all query engines. > > 4. We need to align the syntax including arg names across query engines. > Otherwise, it will be a mess if there is a cosmetic difference in each > query engine. > > 5. Spark does not have a plugin for stored procedures. There is a proposal > from Ryan to add function catalog API. I think it is a bit different from > the stored procedure catalog as functions are used in SELECT and procedures > are used in CALL. While we can explore how to add such support to Spark, we > most likely need to start with SQL extensions in Iceberg. Otherwise, we > will be blocked for a long time. > > 6. Wherever possible, SQL calls must return some output that should be a > summary of what was done. For example, if we expire snapshots, return the > number of expired snapshots, the number of removed data and metadata files, > the number of scanned manifests, etc. If we import a table, output the > number of imported files, etc. > > 7. SQL calls must be smart. For example, we should not simply rewrite all > metadata or data. Commands should analyze what needs to be rewritten. I’ve > tried to outline that for metadata and will submit a doc for data > compaction. > > - Anton > > > On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> > wrote: > > Hi devs, > > I want to start a discussion on whether we want to have some SQL > extensions in Iceberg that should help data engineers to invoke > Iceberg-specific functionality through SQL. I know companies have this > internally but I would like to unify this starting from Spark 3 and share > the same syntax across query engines to have a consistent behavior. > > I’ve put together a short doc: > > > https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8 > > I’d appreciate everyone’s feedback. Please, feel free to comment and add > alternatives. > > Thanks, > Anton > > > -- Ryan Blue Software Engineer Netflix