academy-codex opened a new pull request, #54478:
URL: https://github.com/apache/spark/pull/54478

   ### What changes were proposed in this pull request?
   This PR adds support for path-like targets in `DataFrame.mergeInto`.
   
   Today `mergeInto(table, condition)` is parsed strictly as a multipart 
identifier. This prevents direct usage of path-like targets such as 
`/path/to/table` or `abfss://...` unless the caller manually wraps them in 
SQL-on-file syntax.
   
   The patch updates `MergeIntoWriter` to:
   - keep the existing path for valid multipart identifiers unchanged,
   - and when parsing fails with `ParseException`, detect path-like targets and 
resolve them as:
     - `Seq(defaultDataSourceName, path)`
   
   This aligns `mergeInto` with SQL-on-file behavior while preserving existing 
identifier semantics.
   
   Additionally:
   - Scala API docs for `Dataset.mergeInto` were updated to document 
SQL-on-file and path-like targets.
   - PySpark docs for `DataFrame.mergeInto` were updated similarly.
   - New tests were added in a dedicated Scala suite and in PySpark tests.
   
   ### Why are the changes needed?
   The issue requests support for path-based usage in PySpark merge flows 
(SPARK-54418). In modern lakehouse workflows, users often operate on 
path-addressed data directly (`abfss://`, `s3://`, local paths) without catalog 
registration for every target.
   
   Without this change, `mergeInto` is inconsistent with other Spark path-based 
APIs and with SQL-on-file style table references.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   `DataFrame.mergeInto(table, condition)` now accepts path-like `table` 
strings directly (for example `/tmp/target`, `abfss://container@account/...`) 
and interprets them as SQL-on-file targets using the default data source.
   
   Existing behavior for catalog identifiers and explicit SQL-on-file targets 
(for example `delta.`path``) remains unchanged.
   
   ### How was this patch tested?
   Added tests:
   - 
`sql/core/src/test/scala/org/apache/spark/sql/classic/MergeIntoWriterSuite.scala`
     - supports raw path target
     - supports URI path target
     - keeps explicit SQL-on-file target unchanged
     - still fails invalid non-path target parse
   - `python/pyspark/sql/tests/test_dataframe.py`
     - constructor-level merge writer creation for raw path and URI path targets
   
   Executed:
   - `build/sbt -Dsbt.log.noformat=true "sql/testOnly 
org.apache.spark.sql.classic.MergeIntoWriterSuite"`
     - Passed: 4 tests, 0 failed.
   
   Additional validation:
   - `python3 -m py_compile python/pyspark/sql/tests/test_dataframe.py 
python/pyspark/sql/dataframe.py`
   
   Note: running the targeted `python/run-tests` test in this environment was 
blocked because Spark assembly artifacts are not built locally.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Codex (GPT-5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to