All the existing DML APIs we support today have a source query so they all start with the source DataFrame, e.g. sourceDf.write.insertInto... sourceDf.write.saveAsTable... sourceDf.mergeInto...
However, this is not the case for UPDATE and DELETE, as there is no source query. We need a different style of APIs for them, which should start with the target table. I'm in favor of option 3 due to its compile-time safety and clear intention. We can probably support all DML APIs with this style as well, e.g. spark.catalog.getTable(...).update(...) spark.catalog.getTable(...).delete(...) spark.catalog.getTable(...).insertFrom(...) spark.catalog.getTable(...).mergeFrom(...) Or we can make it more like SQL: spark.catalog.updateTable(tableName, ...) spark.catalog.deleteFrom(tableName, ...) spark.catalog.mergeInto(tableName, sourceDataFrame): MergeBuilder spark.catalog.writeInto(tableName, sourceDataFrame): DataDrameWriterV2 On Tue, Sep 24, 2024 at 8:54 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi all, > > In https://github.com/apache/spark/pull/47233, we are looking to add a > Spark DataFrame API for functional equivalence to Spark SQL's UPDATE > statement. > > There are open discussions on the PR about location/format of the API, and > we wanted to ask on devlist to get more opinions. > > One consideration, is that Update SQL is an isolated, terminal operation > only on DSV2 tables that cannot be chained to other operations. > > I made a quick write up about the background and discussed options in > https://docs.google.com/document/d/1AjkxOU06pFEzFmSbepfxdHoUGtvNAk6X1WY3zHGTW_o/edit. > It is my first one, so please let me know if I missed something. > > Look forward to hearing from more Spark devs on thoughts, either in the > PR, document, or reply to this email. > > Thank you, > Szehon > >