I think it is important to distinguish between two different concepts:

  * Adherence to standards and their well established implementations.
  * Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can
be achieved without the other.

  * The former approach doesn't imply that all features of SQL standard
    (or its specific implementation) are provided. It is sufficient that
    commonly used features that are implemented, are standard compliant.
    Therefore if end user applies some well known pattern, thing will
    work as expected. I

    In my personal opinion that's something that is worth the required
    development resources, and in general should happen within the project.

  * The latter one is more complicated. First of all the premise that
    one can "migrate PostgreSQL workloads to Spark" seems to be flawed.
    While both Spark and PostgreSQL evolve, and probably have more in
    common today, than a few years ago, they're not even close enough to
    pretend that one can be replacement for the other. In contrast,
    existing compatibility layers between major vendors make sense,
    because feature disparity (at least when it comes to core
    functionality) is usually minimal. And that doesn't even touch the
    problem that PostgreSQL provides extensively used extension points
    that enable broad and evolving ecosystem (what should we do about
    continuous queries? Should Structured Streaming provide some
    compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with
    some analytical tools that itself provide some PostgreSQL
    compatibility, but these are not always fully compatible with
    upstream PostgreSQL, nor necessarily follow the latest PostgreSQL
    development.

    Furthermore compatibility layer can be, within certain limits (i.e.
    availability of required primitives), maintained as a separate
    project, without putting more strain on existing resources.
    Effectively what we care about here is if we can translate certain
    SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark
> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and
> PostgreSQL. Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
>     2.2 Spark's behavior makes sense but violates SQL standard: change
> the behavior to follow SQL standard and PostgreSQL, when the ansi mode
> is enabled (default false).
>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is
> Spark native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to
> migrate PostgreSQL workloads to Spark. Other databases have this
> strategy too. For example, DB2 provides an oracle dialect
> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>.
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and
> make the Spark codebase pretty complicated, but still not able to
> provide a usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement
> of migrating PostgreSQL workloads. I think it's much more important to
> make Spark ANSI-compliant first, which doesn't need that much of work.
>
> Recently I've seen multiple PRs adding PostgreSQL cast functions,
> while our own cast function is not ANSI-compliant yet. This makes me
> think that, we should do something to properly prioritize ANSI mode
> over other dialects.
>
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3
> features under PostgreSQL dialect:
> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
> also allowed as true string.
> 2. `date - date`  returns interval in Spark (SQL standard behavior),
> but return int in PostgreSQL
> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
> (there is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just
> update the answer files of PostgreSQL tests.
>
> Any comments are welcome!
>
> Thanks,
> Wenchen

-- 
Best regards,
Maciej

Reply via email to