Hi everyone,

As a follow-up to earlier discussion on logical plans and the v2 data
sources, I’ve submitted a new SPIP, SPARK-23521
<https://issues.apache.org/jira/browse/SPARK-23521>.

This SPIP is based on that DataSourceV2 implementation discussion and
officially proposes the recommendations.

The proposal is to standardize the logical plans used for write operations
to make the planner more maintainable and to make Spark’s write behavior
predictable and reliable. It proposes the following principles:

   - Use well-defined logical plan nodes for all high-level operations:
   insert, create, CTAS, overwrite table, etc.
   - Use planner rules that match on these high-level nodes, so that it
   isn’t necessary to create rules to match each eventual code path
   individually.
   - Clearly define Spark’s behavior for these logical plan nodes. Physical
   nodes should implement that behavior so that all code paths eventually make
   the same guarantees.
   - Specialize implementation when creating a physical plan, not logical
   plans. This will avoid behavior drift and ensure planner code is shared
   across physical implementations.

The SPIP doc presents a small but complete set of those high-level logical
operations, most of which are already defined in SQL or implemented by some
write path in Spark.

rb
​
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to