I think the moment we start touching catalyst we should be using Scala. If
in the future there is a stored procedure api in Spark we can always go
back to Java.

On Tue, Aug 25, 2020, 4:59 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> One more point we should clarify before implementing: where will the SQL
> extensions live? In case of Presto, the extensions will be exposed as
> proper stored procedures and can be part of the Presto repo. In case of
> Spark, we could either keep them in a new module in Iceberg or in a
> completely different repo. Personally, I think 1st option is better as
> those extensions will be essential for everyone who uses Iceberg. Also,
> they will be optional. It is up to the users to enable them.
>
> W.r.t. the Spark extensions, we also need to clarify the language to use.
> In general, it will be easier to use Scala but we recently got rid of it to
> avoid reasoning about different Scala versions. Writing extensions in Java
> may be a bit painful. That’s why I am inclined to having a Scala module for
> Spark 3 SQL extensions in Iceberg that users can optionally enable via
> SessionExtensions. What does everybody else think?
>
> Thanks,
> Anton
>
> On 4 Aug 2020, at 15:12, Anton Okolnychyi <aokolnyc...@apple.com.INVALID>
> wrote:
>
> During the last sync we discussed a blocker for this work raised by Carl.
> It was unclear how role-based control will work in the proposed approach.
> Specifically, how to ensure that user `X` not only has access to a stored
> procedure but is also allowed to execute it on table `T` where table name
> `T` is provided as an argument.
>
> The Presto community can expose an API for performing security checks
> within stored procedures to address this. It is not ideal as it is up to
> the stored procedure to do all checks correctly but it solves the problem.
>
> - Anton
>
> On 29 Jul 2020, at 13:46, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> That looks like a good plan to me. Initially using stored procedures and
> adding custom syntax where possible sounds like a good way to start.
>
> For Spark, I agree that we can start exploring a plugin that can extend
> Spark's syntax. Having that done will make development faster and make it
> easier to get this upstream, I think.
>
> On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi <
> aokolnyc...@apple.com.invalid> wrote:
>
>> Thanks everybody for taking a look at the doc. FYI, I’ve updated it.
>>
>> I would like to share some intermediate thoughts.
>>
>> 1. It seems beneficial to follow the stored procedures approach to call
>> small actions like rollback or expire snapshots. Presto already allows
>> connectors to define stored procedures and it will be much easier to add
>> such syntax to other query engines as it is standard SQL. If we go that
>> route, optional arguments and name-based arguments can make the syntax very
>> reasonable for straightforward operations.
>>
>> 2. There are still some cases where separate commands *may* make sense.
>> For example, it may be more natural to have SNAPSHOT or MIGRATE as separate
>> commands. That way, we can use well-known clauses like TBLPROPERTIES.
>> Later, we may build a VACUUM command with different modes to combine 3-4
>> actions. We have SNAPSHOT and MIGRATE internally and they are frequently
>> used (especially SNAPSHOT).
>>
>> 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is
>> unlikely we can get them into query engines even though the commands are
>> generic. So, we may need to maintain them in Iceberg in a form of SQL
>> extensions (e.g. extended parser via SQL extensions in Spark). That may
>> not be always possible in all query engines.
>>
>> 4. We need to align the syntax including arg names across query engines.
>> Otherwise, it will be a mess if there is a cosmetic difference in each
>> query engine.
>>
>> 5. Spark does not have a plugin for stored procedures. There is a
>> proposal from Ryan to add function catalog API. I think it is a bit
>> different from the stored procedure catalog as functions are used in SELECT
>> and procedures are used in CALL. While we can explore how to add such
>> support to Spark, we most likely need to start with SQL extensions in
>> Iceberg. Otherwise, we will be blocked for a long time.
>>
>> 6. Wherever possible, SQL calls must return some output that should be a
>> summary of what was done. For example, if we expire snapshots, return the
>> number of expired snapshots, the number of removed data and metadata files,
>> the number of scanned manifests, etc. If we import a table, output the
>> number of imported files, etc.
>>
>> 7. SQL calls must be smart. For example, we should not simply rewrite all
>> metadata or data. Commands should analyze what needs to be rewritten. I’ve
>> tried to outline that for metadata and will submit a doc for data
>> compaction.
>>
>> - Anton
>>
>>
>> On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnyc...@apple.com.INVALID>
>> wrote:
>>
>> Hi devs,
>>
>> I want to start a discussion on whether we want to have some SQL
>> extensions in Iceberg that should help data engineers to invoke
>> Iceberg-specific functionality through SQL. I know companies have this
>> internally but I would like to unify this starting from Spark 3 and share
>> the same syntax across query engines to have a consistent behavior.
>>
>> I’ve put together a short doc:
>>
>>
>> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8
>>
>> I’d appreciate everyone’s feedback. Please, feel free to comment and add
>> alternatives.
>>
>> Thanks,
>> Anton
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>

Reply via email to