Re: [DISCUSS] PostgreSQL dialect

Yuanjian Li Wed, 04 Dec 2019 16:57:48 -0800

Thanks all of you for joining the discussion.
The PR is given in https://github.com/apache/spark/pull/26763, all the
PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.


Best,
Yuanjian

Driesprong, Fokko <[email protected]> 于2019年12月1日周日 下午7:24写道：

> +1 (non-binding)
>
> Cheers, Fokko
>
> Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <[email protected]
> >:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <[email protected]>
>> wrote:
>>
>>> Yea, +1, that looks pretty reasonable to me.
>>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> I personally think we could at least stop work about the Dialect until
>>> 3.0 released.
>>>
>>>
>>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>>> [email protected]> wrote:
>>>
>>>> +1 with the practical proposal.
>>>> To me, the major concern is that the code base becomes complicated,
>>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be
>>>> clean.
>>>> Furthermore, the PostgreSQL dialect configuration overlaps with the
>>>> ANSI mode, which can be confusing sometimes.
>>>>
>>>> Gengliang
>>>>
>>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>> One particular negative effect has been that new postgresql tests add
>>>>>> well over an hour to tests,
>>>>>
>>>>>
>>>>> Adding postgresql tests is for improving the test coverage of Spark
>>>>> SQL. We should continue to do this by importing more test cases. The
>>>>> quality of Spark highly depends on the test coverage. We can further
>>>>> paralyze the test execution to reduce the test time.
>>>>>
>>>>> Migrating PostgreSQL workloads to Spark SQL
>>>>>
>>>>>
>>>>> This should not be our current focus. In the near future, it is
>>>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>>>> adding features that are useful to Spark community. PostgreSQL is a good
>>>>> reference, but we do not need to blindly follow it. We already closed
>>>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>>>> not commonly used.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think it is important to distinguish between two different concepts:
>>>>>>
>>>>>>    - Adherence to standards and their well established
>>>>>>    implementations.
>>>>>>    - Enabling migrations from some product X to Spark.
>>>>>>
>>>>>> While these two problems are related, there are independent and one
>>>>>> can be achieved without the other.
>>>>>>
>>>>>>    - The former approach doesn't imply that all features of SQL
>>>>>>    standard (or its specific implementation) are provided. It is 
>>>>>> sufficient
>>>>>>    that commonly used features that are implemented, are standard 
>>>>>> compliant.
>>>>>>    Therefore if end user applies some well known pattern, thing will 
>>>>>> work as
>>>>>>    expected. I
>>>>>>
>>>>>>    In my personal opinion that's something that is worth the
>>>>>>    required development resources, and in general should happen within 
>>>>>> the
>>>>>>    project.
>>>>>>
>>>>>>
>>>>>>    - The latter one is more complicated. First of all the premise
>>>>>>    that one can "migrate PostgreSQL workloads to Spark" seems to be 
>>>>>> flawed.
>>>>>>    While both Spark and PostgreSQL evolve, and probably have more in 
>>>>>> common
>>>>>>    today, than a few years ago, they're not even close enough to pretend 
>>>>>> that
>>>>>>    one can be replacement for the other. In contrast, existing 
>>>>>> compatibility
>>>>>>    layers between major vendors make sense, because feature disparity
>>>>>>    (at least when it comes to core functionality) is usually
>>>>>>    minimal. And that doesn't even touch the problem that PostgreSQL 
>>>>>> provides
>>>>>>    extensively used extension points that enable broad and evolving 
>>>>>> ecosystem
>>>>>>    (what should we do about continuous queries? Should Structured 
>>>>>> Streaming
>>>>>>    provide some compatibility layer as well?).
>>>>>>
>>>>>>    More realistically Spark could provide a compatibility layer with
>>>>>>    some analytical tools that itself provide some PostgreSQL 
>>>>>> compatibility,
>>>>>>    but these are not always fully compatible with upstream PostgreSQL, 
>>>>>> nor
>>>>>>    necessarily follow the latest PostgreSQL development.
>>>>>>
>>>>>>    Furthermore compatibility layer can be, within certain limits
>>>>>>    (i.e. availability of required primitives), maintained as a separate
>>>>>>    project, without putting more strain on existing resources. 
>>>>>> Effectively
>>>>>>    what we care about here is if we can translate certain SQL string into
>>>>>>    logical or physical plan.
>>>>>>
>>>>>>
>>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Recently we start an effort to achieve feature parity between Spark
>>>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>>>>
>>>>>> This goes very well. We've added many missing features(parser rules,
>>>>>> built-in functions, etc.) to Spark, and also corrected several
>>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>>>> Many thanks to all the people that contribute to it!
>>>>>>
>>>>>> There are several cases when adding a PostgreSQL feature:
>>>>>> 1. Spark doesn't have this feature: just add it.
>>>>>> 2. Spark has this feature, but the behavior is different:
>>>>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>>>     2.2 Spark's behavior makes sense but violates SQL standard:
>>>>>> change the behavior to follow SQL standard and PostgreSQL, when the ansi
>>>>>> mode is enabled (default false).
>>>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL
>>>>>> standard: adds the PostgreSQL behavior under the PostgreSQL dialect
>>>>>> (default is Spark native dialect).
>>>>>>
>>>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>>>>> too. For example, DB2 provides an oracle dialect
>>>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>>>>> .
>>>>>>
>>>>>> However, there are so many differences between Spark and PostgreSQL,
>>>>>> including SQL parsing, type coercion, function/operator behavior, data
>>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>>>>> the Spark codebase pretty complicated, but still not able to provide a
>>>>>> usable PostgreSQL dialect.
>>>>>>
>>>>>> Furthermore, it's not clear to me how many users have the requirement
>>>>>> of migrating PostgreSQL workloads. I think it's much more important to 
>>>>>> make
>>>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>>>
>>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions,
>>>>>> while our own cast function is not ANSI-compliant yet. This makes me 
>>>>>> think
>>>>>> that, we should do something to properly prioritize ANSI mode over other
>>>>>> dialects.
>>>>>>
>>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove
>>>>>> it from the codebase before it's too late. Curently we only have 3 
>>>>>> features
>>>>>> under PostgreSQL dialect:
>>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
>>>>>> also allowed as true string.
>>>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>>>> but return int in PostgreSQL
>>>>>> 3. `int / int` returns double in Spark, but returns int in
>>>>>> PostgreSQL. (there is no standard)
>>>>>>
>>>>>> We should still add PostgreSQL features that Spark doesn't have, or
>>>>>> Spark's behavior violates SQL standard. But for others, let's just update
>>>>>> the answer files of PostgreSQL tests.
>>>>>>
>>>>>> Any comments are welcome!
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Maciej
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>

Re: [DISCUSS] PostgreSQL dialect

Reply via email to