Thanks all of you for joining the discussion. The PR is given in https://github.com/apache/spark/pull/26763, all the PostgreSQL dialect related PRs are linked in the description. Hoping the authors could help in reviewing.
Best, Yuanjian Driesprong, Fokko <fo...@driesprong.frl> 于2019年12月1日周日 下午7:24写道: > +1 (non-binding) > > Cheers, Fokko > > Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <dongjoon.h...@gmail.com > >: > >> +1 >> >> Bests, >> Dongjoon. >> >> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> >>> Yea, +1, that looks pretty reasonable to me. >>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it >>> from the codebase before it's too late. Curently we only have 3 features >>> under PostgreSQL dialect: >>> I personally think we could at least stop work about the Dialect until >>> 3.0 released. >>> >>> >>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang < >>> gengliang.w...@databricks.com> wrote: >>> >>>> +1 with the practical proposal. >>>> To me, the major concern is that the code base becomes complicated, >>>> while the PostgreSQL dialect has very limited features. I tried introducing >>>> one big flag `spark.sql.dialect` and isolating related code in #25697 >>>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be >>>> clean. >>>> Furthermore, the PostgreSQL dialect configuration overlaps with the >>>> ANSI mode, which can be confusing sometimes. >>>> >>>> Gengliang >>>> >>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lix...@databricks.com> wrote: >>>> >>>>> +1 >>>>> >>>>> >>>>>> One particular negative effect has been that new postgresql tests add >>>>>> well over an hour to tests, >>>>> >>>>> >>>>> Adding postgresql tests is for improving the test coverage of Spark >>>>> SQL. We should continue to do this by importing more test cases. The >>>>> quality of Spark highly depends on the test coverage. We can further >>>>> paralyze the test execution to reduce the test time. >>>>> >>>>> Migrating PostgreSQL workloads to Spark SQL >>>>> >>>>> >>>>> This should not be our current focus. In the near future, it is >>>>> impossible to be fully compatible with PostgreSQL. We should focus on >>>>> adding features that are useful to Spark community. PostgreSQL is a good >>>>> reference, but we do not need to blindly follow it. We already closed >>>>> multiple related JIRAs that try to add some PostgreSQL features that are >>>>> not commonly used. >>>>> >>>>> Cheers, >>>>> >>>>> Xiao >>>>> >>>>> >>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz < >>>>> mszymkiew...@gmail.com> wrote: >>>>> >>>>>> I think it is important to distinguish between two different concepts: >>>>>> >>>>>> - Adherence to standards and their well established >>>>>> implementations. >>>>>> - Enabling migrations from some product X to Spark. >>>>>> >>>>>> While these two problems are related, there are independent and one >>>>>> can be achieved without the other. >>>>>> >>>>>> - The former approach doesn't imply that all features of SQL >>>>>> standard (or its specific implementation) are provided. It is >>>>>> sufficient >>>>>> that commonly used features that are implemented, are standard >>>>>> compliant. >>>>>> Therefore if end user applies some well known pattern, thing will >>>>>> work as >>>>>> expected. I >>>>>> >>>>>> In my personal opinion that's something that is worth the >>>>>> required development resources, and in general should happen within >>>>>> the >>>>>> project. >>>>>> >>>>>> >>>>>> - The latter one is more complicated. First of all the premise >>>>>> that one can "migrate PostgreSQL workloads to Spark" seems to be >>>>>> flawed. >>>>>> While both Spark and PostgreSQL evolve, and probably have more in >>>>>> common >>>>>> today, than a few years ago, they're not even close enough to pretend >>>>>> that >>>>>> one can be replacement for the other. In contrast, existing >>>>>> compatibility >>>>>> layers between major vendors make sense, because feature disparity >>>>>> (at least when it comes to core functionality) is usually >>>>>> minimal. And that doesn't even touch the problem that PostgreSQL >>>>>> provides >>>>>> extensively used extension points that enable broad and evolving >>>>>> ecosystem >>>>>> (what should we do about continuous queries? Should Structured >>>>>> Streaming >>>>>> provide some compatibility layer as well?). >>>>>> >>>>>> More realistically Spark could provide a compatibility layer with >>>>>> some analytical tools that itself provide some PostgreSQL >>>>>> compatibility, >>>>>> but these are not always fully compatible with upstream PostgreSQL, >>>>>> nor >>>>>> necessarily follow the latest PostgreSQL development. >>>>>> >>>>>> Furthermore compatibility layer can be, within certain limits >>>>>> (i.e. availability of required primitives), maintained as a separate >>>>>> project, without putting more strain on existing resources. >>>>>> Effectively >>>>>> what we care about here is if we can translate certain SQL string into >>>>>> logical or physical plan. >>>>>> >>>>>> >>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> Recently we start an effort to achieve feature parity between Spark >>>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 >>>>>> >>>>>> This goes very well. We've added many missing features(parser rules, >>>>>> built-in functions, etc.) to Spark, and also corrected several >>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. >>>>>> Many thanks to all the people that contribute to it! >>>>>> >>>>>> There are several cases when adding a PostgreSQL feature: >>>>>> 1. Spark doesn't have this feature: just add it. >>>>>> 2. Spark has this feature, but the behavior is different: >>>>>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL >>>>>> standard and PostgreSQL, with a legacy config to restore the behavior. >>>>>> 2.2 Spark's behavior makes sense but violates SQL standard: >>>>>> change the behavior to follow SQL standard and PostgreSQL, when the ansi >>>>>> mode is enabled (default false). >>>>>> 2.3 Spark's behavior makes sense and doesn't violate SQL >>>>>> standard: adds the PostgreSQL behavior under the PostgreSQL dialect >>>>>> (default is Spark native dialect). >>>>>> >>>>>> The PostgreSQL dialect itself is a good idea. It can help users to >>>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy >>>>>> too. For example, DB2 provides an oracle dialect >>>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html> >>>>>> . >>>>>> >>>>>> However, there are so many differences between Spark and PostgreSQL, >>>>>> including SQL parsing, type coercion, function/operator behavior, data >>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make >>>>>> the Spark codebase pretty complicated, but still not able to provide a >>>>>> usable PostgreSQL dialect. >>>>>> >>>>>> Furthermore, it's not clear to me how many users have the requirement >>>>>> of migrating PostgreSQL workloads. I think it's much more important to >>>>>> make >>>>>> Spark ANSI-compliant first, which doesn't need that much of work. >>>>>> >>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, >>>>>> while our own cast function is not ANSI-compliant yet. This makes me >>>>>> think >>>>>> that, we should do something to properly prioritize ANSI mode over other >>>>>> dialects. >>>>>> >>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove >>>>>> it from the codebase before it's too late. Curently we only have 3 >>>>>> features >>>>>> under PostgreSQL dialect: >>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are >>>>>> also allowed as true string. >>>>>> 2. `date - date` returns interval in Spark (SQL standard behavior), >>>>>> but return int in PostgreSQL >>>>>> 3. `int / int` returns double in Spark, but returns int in >>>>>> PostgreSQL. (there is no standard) >>>>>> >>>>>> We should still add PostgreSQL features that Spark doesn't have, or >>>>>> Spark's behavior violates SQL standard. But for others, let's just update >>>>>> the answer files of PostgreSQL tests. >>>>>> >>>>>> Any comments are welcome! >>>>>> >>>>>> Thanks, >>>>>> Wenchen >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Maciej >>>>>> >>>>>> >>>>> >>>>> -- >>>>> [image: Databricks Summit - Watch the talks] >>>>> <https://databricks.com/sparkaisummit/north-america> >>>>> >>>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >>