Yea, +1, that looks pretty reasonable to me. > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect: I personally think we could at least stop work about the Dialect until 3.0 released.
On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang < gengliang.w...@databricks.com> wrote: > +1 with the practical proposal. > To me, the major concern is that the code base becomes complicated, while > the PostgreSQL dialect has very limited features. I tried introducing one > big flag `spark.sql.dialect` and isolating related code in #25697 > <https://github.com/apache/spark/pull/25697>, but it seems hard to be > clean. > Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI > mode, which can be confusing sometimes. > > Gengliang > > On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lix...@databricks.com> wrote: > >> +1 >> >> >>> One particular negative effect has been that new postgresql tests add >>> well over an hour to tests, >> >> >> Adding postgresql tests is for improving the test coverage of Spark SQL. >> We should continue to do this by importing more test cases. The quality of >> Spark highly depends on the test coverage. We can further paralyze the test >> execution to reduce the test time. >> >> Migrating PostgreSQL workloads to Spark SQL >> >> >> This should not be our current focus. In the near future, it is >> impossible to be fully compatible with PostgreSQL. We should focus on >> adding features that are useful to Spark community. PostgreSQL is a good >> reference, but we do not need to blindly follow it. We already closed >> multiple related JIRAs that try to add some PostgreSQL features that are >> not commonly used. >> >> Cheers, >> >> Xiao >> >> >> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz < >> mszymkiew...@gmail.com> wrote: >> >>> I think it is important to distinguish between two different concepts: >>> >>> - Adherence to standards and their well established implementations. >>> - Enabling migrations from some product X to Spark. >>> >>> While these two problems are related, there are independent and one can >>> be achieved without the other. >>> >>> - The former approach doesn't imply that all features of SQL >>> standard (or its specific implementation) are provided. It is sufficient >>> that commonly used features that are implemented, are standard compliant. >>> Therefore if end user applies some well known pattern, thing will work as >>> expected. I >>> >>> In my personal opinion that's something that is worth the required >>> development resources, and in general should happen within the project. >>> >>> >>> - The latter one is more complicated. First of all the premise that >>> one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While >>> both Spark and PostgreSQL evolve, and probably have more in common today, >>> than a few years ago, they're not even close enough to pretend that one >>> can >>> be replacement for the other. In contrast, existing compatibility layers >>> between major vendors make sense, because feature disparity (at >>> least when it comes to core functionality) is usually minimal. And that >>> doesn't even touch the problem that PostgreSQL provides extensively used >>> extension points that enable broad and evolving ecosystem (what should we >>> do about continuous queries? Should Structured Streaming provide some >>> compatibility layer as well?). >>> >>> More realistically Spark could provide a compatibility layer with >>> some analytical tools that itself provide some PostgreSQL compatibility, >>> but these are not always fully compatible with upstream PostgreSQL, nor >>> necessarily follow the latest PostgreSQL development. >>> >>> Furthermore compatibility layer can be, within certain limits (i.e. >>> availability of required primitives), maintained as a separate project, >>> without putting more strain on existing resources. Effectively what we >>> care >>> about here is if we can translate certain SQL string into logical or >>> physical plan. >>> >>> >>> On 11/26/19 3:26 PM, Wenchen Fan wrote: >>> >>> Hi all, >>> >>> Recently we start an effort to achieve feature parity between Spark and >>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 >>> >>> This goes very well. We've added many missing features(parser rules, >>> built-in functions, etc.) to Spark, and also corrected several >>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. >>> Many thanks to all the people that contribute to it! >>> >>> There are several cases when adding a PostgreSQL feature: >>> 1. Spark doesn't have this feature: just add it. >>> 2. Spark has this feature, but the behavior is different: >>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL >>> standard and PostgreSQL, with a legacy config to restore the behavior. >>> 2.2 Spark's behavior makes sense but violates SQL standard: change >>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is >>> enabled (default false). >>> 2.3 Spark's behavior makes sense and doesn't violate SQL standard: >>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark >>> native dialect). >>> >>> The PostgreSQL dialect itself is a good idea. It can help users to >>> migrate PostgreSQL workloads to Spark. Other databases have this strategy >>> too. For example, DB2 provides an oracle dialect >>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html> >>> . >>> >>> However, there are so many differences between Spark and PostgreSQL, >>> including SQL parsing, type coercion, function/operator behavior, data >>> types, etc. I'm afraid that we may spend a lot of effort on it, and make >>> the Spark codebase pretty complicated, but still not able to provide a >>> usable PostgreSQL dialect. >>> >>> Furthermore, it's not clear to me how many users have the requirement of >>> migrating PostgreSQL workloads. I think it's much more important to make >>> Spark ANSI-compliant first, which doesn't need that much of work. >>> >>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while >>> our own cast function is not ANSI-compliant yet. This makes me think that, >>> we should do something to properly prioritize ANSI mode over other dialects. >>> >>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it >>> from the codebase before it's too late. Curently we only have 3 features >>> under PostgreSQL dialect: >>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also >>> allowed as true string. >>> 2. `date - date` returns interval in Spark (SQL standard behavior), but >>> return int in PostgreSQL >>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL. >>> (there is no standard) >>> >>> We should still add PostgreSQL features that Spark doesn't have, or >>> Spark's behavior violates SQL standard. But for others, let's just update >>> the answer files of PostgreSQL tests. >>> >>> Any comments are welcome! >>> >>> Thanks, >>> Wenchen >>> >>> -- >>> Best regards, >>> Maciej >>> >>> >> >> -- >> [image: Databricks Summit - Watch the talks] >> <https://databricks.com/sparkaisummit/north-america> >> > -- --- Takeshi Yamamuro