I think those are fair concerns, I was mostly just updating tests for RC2 and adding in "append" everywhere
Code like spark.sql(s"SELECT a, b from $ks.test1") .write .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "test_insert1", "keyspace" -> ks)) .save() Now fails at runtime, while it would have succeeded before. This is probably not a huge issue since the majority of actual usages aren't writing to empty tables. I think my main concern here is that a lot of our old demos and tutorials where * Make table outside of Spark * Write to table with spark Now obviously they can be done in a single operation in spark so that's probably the best path forward. The old pathway is pretty awkward, I just didn't really want it to break it didn't have to but I think having different defaults is definitely not intuitive. I think the majority of other use cases are "append" anyway so it's not a big pain for non-demo / just trying things out users. Thanks for commenting, Russ On Wed, May 20, 2020 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote: > The context on this is that it was confusing that the mode changed, which > introduced different behaviors for the same user code when moving from v1 > to v2. Burak pointed this out and I agree that it's weird that if your > dependency changes from v1 to v2, your compiled Spark job starts appending > instead of erroring out when the table exists. > > The work-around is to implement a new trait, SupportsCatalogOptions, that > allows you to extract a table identifier and catalog name from the options > in the DataFrameReader. That way, you can re-route to your catalog so that > Spark correctly uses a CreateTableAsSelect statement for ErrorIfExists > mode. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java > > On Wed, May 20, 2020 at 2:50 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> >> While the ScalaDocs for DataFrameWriter say >> >> /** >> * Specifies the behavior when data or table already exists. Options include: >> * <ul> >> * <li>`SaveMode.Overwrite`: overwrite the existing data.</li> >> * <li>`SaveMode.Append`: append the data.</li> >> * <li>`SaveMode.Ignore`: ignore the operation (i.e. no-op).</li> >> * <li>`SaveMode.ErrorIfExists`: throw an exception at runtime.</li> >> * </ul> >> * <p> >> * When writing to data source v1, the default option is `ErrorIfExists`. >> When writing to data >> * source v2, the default option is `Append`. >> * >> * @since 1.4.0 >> */ >> >> >> As far as I can tell, using DataFrame writer with a TableProviding >> DataSource V2 will still default to ErrorIfExists which breaks existing >> code since DSV2 cannot support ErrorIfExists mode. I noticed in the history >> of DataframeWriter there were versions which differentiated between DSV2 >> and DSV1 and set the mode accordingly but this seems to no longer be the >> case. Was this intentional? I feel like if we could >> have the default be based on the Source then upgrading code from DSV1 -> >> DSV2 would be much easier for users. >> >> I'm currently testing this on RC2 >> >> >> Any thoughts? >> >> Thanks for your time as usual, >> Russ >> > > > -- > Ryan Blue > Software Engineer > Netflix >