Re: Supports Dynamic Table Options for Spark SQL

Mich Talebzadeh Thu, 18 Nov 2021 11:02:41 -0800

OK on this let us dig a bit deeper  focusing on time travel queries (TTQ).

The interest is to return data as it appeared at a  specific time. So the
discussion is now on how to enable this.


We can specify this  by using a placeholder such as 'AS OF SYSTEM TIME'
after the table name in a FROM

SELECT * FROM t AS OF SYSTEM TIME '2021-11-18 18:45:00'

Meaning return all the rows of table 't' as they appeared at that time.

This syntax makes it pretty standard in line with the existing databases
that support TTQ

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 17 Nov 2021 at 06:08, Wenchen Fan <cloud0...@gmail.com> wrote:

> It's useful to have a SQL API to specify table options, similar to the
> DataFrameReader API. However, I share the same concern from @Hyukjin Kwon
> <gurwls...@gmail.com> and am not very comfortable with using hints to do
> it.
>
> In the PR, someone mentioned TVF. I think it's better than hints, but
> still has problems. For example, shall we support `FROM read(t AS VERSION
> OF 1, options...)`?
>
> We probably should investigate if there are similar SQL syntaxes in other
> databases first.
>
> On Wed, Nov 17, 2021 at 2:39 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> This concept is explained here
>> <https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html>
>> somehow. If this is true why cannot we just use
>>
>> SELECT * FROM <TABBLE_NAME> VERSION AS OF <VERSION_NO>
>>
>>
>>   view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 16 Nov 2021 at 17:49, Ryan Blue <b...@tabular.io> wrote:
>>
>>> Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS
>>> OF syntax.
>>>
>>> On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> As I stated before, hints are designed to direct the optimizer to
>>>> choose a certain query execution plan based on the specific criteria.
>>>>
>>>>
>>>> -- time travel
>>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>>
>>>>
>>>> The alternative would be to specify time travel by creating a snapshot
>>>> based on CURRENT_DATE() range which encapsulates time travel for
>>>> 'snapshot-id'='10963874102873L'
>>>>
>>>>
>>>> CREATE SNAPSHOT t_snap
>>>>
>>>>   START WITH CURRENT_DATE() - 30
>>>>
>>>>   NEXT CURRENT_DATE()
>>>>
>>>>   AS SELECT * FROM t
>>>>
>>>>
>>>> SELECT * FROM t_snap
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>>>
>>>>> My biggest concern with the syntax in hints is that Spark SQL's
>>>>> options can change results (e.g., CSV's header options) whereas hints are
>>>>> generally not designed to affect the external results if I am not 
>>>>> mistaken.
>>>>> This is counterintuitive.
>>>>> I left the comment in the PR but what's the real benefit over
>>>>> leveraging: SET conf and RESET conf? we can extract options from runtime
>>>>> session configurations e.g., SessionConfigSupport.
>>>>>
>>>>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> Side note about time travel: There is a PR
>>>>>> <https://github.com/apache/spark/pull/34497> to add
>>>>>> VERSION/TIMESTAMP AS OF syntax to Spark SQL.
>>>>>>
>>>>>> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> I want to note that I wouldn't recommend time traveling this way by
>>>>>>> using the hint for `snapshot-id`. Instead, we want to add the standard 
>>>>>>> SQL
>>>>>>> syntax for that in a separate PR. This is useful for other options that
>>>>>>> help a table scan perform better, like specifying the target split size.
>>>>>>>
>>>>>>> You're right that this isn't a typical optimizer hint, but I'm not
>>>>>>> sure what other syntax is possible for this use case. How else would we
>>>>>>> send custom properties through to the scan?
>>>>>>>
>>>>>>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I am looking at the hint and it appears to me (I stand corrected),
>>>>>>>> it is a single table hint as below:
>>>>>>>>
>>>>>>>> -- time travel
>>>>>>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>>>>>>
>>>>>>>> My assumption is that any view on this table will also benefit from
>>>>>>>> this hint. This is not a hint to optimizer in a classical sense. Only a
>>>>>>>> snapshot hint. Normally, a hint is an instruction to the
>>>>>>>> optimizer. When writing SQL, one may know information about the data
>>>>>>>> unknown to the optimizer. Hints enable one to make decisions normally 
>>>>>>>> made
>>>>>>>> by the optimizer, sometimes causing the optimizer to select a plan 
>>>>>>>> that it
>>>>>>>> sees as higher cost.
>>>>>>>>
>>>>>>>>
>>>>>>>> So far as this case is concerned, it looks OK and I concur it
>>>>>>>> should be extended to write as well.
>>>>>>>>
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <
>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I think since we probably will end up using this same syntax on
>>>>>>>>> write, this makes a lot of sense. Unless there is another good way to
>>>>>>>>> express a similar concept during a write operation I think going 
>>>>>>>>> forward
>>>>>>>>> with this would be ok.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue <b...@tabular.io>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The proposed feature is to be able to pass options through SQL
>>>>>>>>>> like you would when using the DataFrameReader API, so it would
>>>>>>>>>> work for all sources that support read options. Read options are 
>>>>>>>>>> part of
>>>>>>>>>> the DSv2 API, there just isn’t a way to pass options when using SQL. 
>>>>>>>>>> The PR
>>>>>>>>>> also has a non-Iceberg example, which is being able to customize 
>>>>>>>>>> some JDBC
>>>>>>>>>> source behaviors per query (e.g., fetchSize), rather than globally 
>>>>>>>>>> in the
>>>>>>>>>> table’s options.
>>>>>>>>>>
>>>>>>>>>> The proposed syntax is odd, but I think that's an artifact of
>>>>>>>>>> Spark introducing read options that aren't a normal part of SQL. 
>>>>>>>>>> Seems
>>>>>>>>>> reasonable to me to pass them through a hint.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Interesting.
>>>>>>>>>>>
>>>>>>>>>>> What is this going to add on top of support for Apache Iceberg
>>>>>>>>>>> <https://www.dremio.com/data-lake/apache-iceberg/>. Will it be
>>>>>>>>>>> in line with support for Hive ACID tables or Delta Lake?
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>> other
>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>> content is
>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang <wangzhun6...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi dev,
>>>>>>>>>>>>
>>>>>>>>>>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>>>>>>>>>>> https://github.com/apache/spark/pull/34072). It is currently
>>>>>>>>>>>> not sure if the syntax makes sense, and would like to know if 
>>>>>>>>>>>> there is
>>>>>>>>>>>> other feedback or opinion on this.
>>>>>>>>>>>>
>>>>>>>>>>>> I would appreciate any feedback on this.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Supports Dynamic Table Options for Spark SQL

Reply via email to