I think the moment we start touching catalyst we should be using Scala. If in the future there is a stored procedure api in Spark we can always go back to Java.
On Tue, Aug 25, 2020, 4:59 PM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > One more point we should clarify before implementing: where will the SQL > extensions live? In case of Presto, the extensions will be exposed as > proper stored procedures and can be part of the Presto repo. In case of > Spark, we could either keep them in a new module in Iceberg or in a > completely different repo. Personally, I think 1st option is better as > those extensions will be essential for everyone who uses Iceberg. Also, > they will be optional. It is up to the users to enable them. > > W.r.t. the Spark extensions, we also need to clarify the language to use. > In general, it will be easier to use Scala but we recently got rid of it to > avoid reasoning about different Scala versions. Writing extensions in Java > may be a bit painful. That’s why I am inclined to having a Scala module for > Spark 3 SQL extensions in Iceberg that users can optionally enable via > SessionExtensions. What does everybody else think? > > Thanks, > Anton > > On 4 Aug 2020, at 15:12, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> > wrote: > > During the last sync we discussed a blocker for this work raised by Carl. > It was unclear how role-based control will work in the proposed approach. > Specifically, how to ensure that user `X` not only has access to a stored > procedure but is also allowed to execute it on table `T` where table name > `T` is provided as an argument. > > The Presto community can expose an API for performing security checks > within stored procedures to address this. It is not ideal as it is up to > the stored procedure to do all checks correctly but it solves the problem. > > - Anton > > On 29 Jul 2020, at 13:46, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > That looks like a good plan to me. Initially using stored procedures and > adding custom syntax where possible sounds like a good way to start. > > For Spark, I agree that we can start exploring a plugin that can extend > Spark's syntax. Having that done will make development faster and make it > easier to get this upstream, I think. > > On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi < > aokolnyc...@apple.com.invalid> wrote: > >> Thanks everybody for taking a look at the doc. FYI, I’ve updated it. >> >> I would like to share some intermediate thoughts. >> >> 1. It seems beneficial to follow the stored procedures approach to call >> small actions like rollback or expire snapshots. Presto already allows >> connectors to define stored procedures and it will be much easier to add >> such syntax to other query engines as it is standard SQL. If we go that >> route, optional arguments and name-based arguments can make the syntax very >> reasonable for straightforward operations. >> >> 2. There are still some cases where separate commands *may* make sense. >> For example, it may be more natural to have SNAPSHOT or MIGRATE as separate >> commands. That way, we can use well-known clauses like TBLPROPERTIES. >> Later, we may build a VACUUM command with different modes to combine 3-4 >> actions. We have SNAPSHOT and MIGRATE internally and they are frequently >> used (especially SNAPSHOT). >> >> 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is >> unlikely we can get them into query engines even though the commands are >> generic. So, we may need to maintain them in Iceberg in a form of SQL >> extensions (e.g. extended parser via SQL extensions in Spark). That may >> not be always possible in all query engines. >> >> 4. We need to align the syntax including arg names across query engines. >> Otherwise, it will be a mess if there is a cosmetic difference in each >> query engine. >> >> 5. Spark does not have a plugin for stored procedures. There is a >> proposal from Ryan to add function catalog API. I think it is a bit >> different from the stored procedure catalog as functions are used in SELECT >> and procedures are used in CALL. While we can explore how to add such >> support to Spark, we most likely need to start with SQL extensions in >> Iceberg. Otherwise, we will be blocked for a long time. >> >> 6. Wherever possible, SQL calls must return some output that should be a >> summary of what was done. For example, if we expire snapshots, return the >> number of expired snapshots, the number of removed data and metadata files, >> the number of scanned manifests, etc. If we import a table, output the >> number of imported files, etc. >> >> 7. SQL calls must be smart. For example, we should not simply rewrite all >> metadata or data. Commands should analyze what needs to be rewritten. I’ve >> tried to outline that for metadata and will submit a doc for data >> compaction. >> >> - Anton >> >> >> On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> >> wrote: >> >> Hi devs, >> >> I want to start a discussion on whether we want to have some SQL >> extensions in Iceberg that should help data engineers to invoke >> Iceberg-specific functionality through SQL. I know companies have this >> internally but I would like to unify this starting from Spark 3 and share >> the same syntax across query engines to have a consistent behavior. >> >> I’ve put together a short doc: >> >> >> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8 >> >> I’d appreciate everyone’s feedback. Please, feel free to comment and add >> alternatives. >> >> Thanks, >> Anton >> >> >> > > -- > Ryan Blue > Software Engineer > Netflix > > > >