All the existing DML APIs we support today have a source query so they all
start with the source DataFrame, e.g.
sourceDf.write.insertInto...
sourceDf.write.saveAsTable...
sourceDf.mergeInto...

However, this is not the case for UPDATE and DELETE, as there is no source
query. We need a different style of APIs for them, which should start with
the target table. I'm in favor of option 3 due to its compile-time safety
and clear intention. We can probably support all DML APIs with this style
as well, e.g.
spark.catalog.getTable(...).update(...)
spark.catalog.getTable(...).delete(...)
spark.catalog.getTable(...).insertFrom(...)
spark.catalog.getTable(...).mergeFrom(...)

Or we can make it more like SQL:
spark.catalog.updateTable(tableName, ...)
spark.catalog.deleteFrom(tableName, ...)
spark.catalog.mergeInto(tableName, sourceDataFrame): MergeBuilder
spark.catalog.writeInto(tableName, sourceDataFrame): DataDrameWriterV2

On Tue, Sep 24, 2024 at 8:54 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hi all,
>
> In https://github.com/apache/spark/pull/47233, we are looking to add a
> Spark DataFrame API for functional equivalence to Spark SQL's UPDATE
> statement.
>
> There are open discussions on the PR about location/format of the API, and
> we wanted to ask on devlist to get more opinions.
>
> One consideration, is that Update SQL is an isolated, terminal operation
> only on DSV2 tables that cannot be chained to other operations.
>
> I made a quick write up about the background and discussed options in
> https://docs.google.com/document/d/1AjkxOU06pFEzFmSbepfxdHoUGtvNAk6X1WY3zHGTW_o/edit.
> It is my first one, so please let me know if I missed something.
>
> Look forward to hearing from more Spark devs on thoughts, either in the
> PR, document, or reply to this email.
>
> Thank you,
> Szehon
>
>

Reply via email to