I don't feel differently than I did on the thread linked above, I think treating "External" as a table option is still the safest way to go about things. For the Cassandra catalog this option wouldn't appear on our whitelist of allowed options, the same as "path" and other options that don't apply to C.
On Tue, Oct 6, 2020 at 3:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > I would summarize both the problem and the current state differently. > > Currently, Spark parses the EXTERNAL keyword for compatibility with Hive > SQL, but Spark’s built-in catalog doesn’t allow creating a table with > EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks > compatibility with Hive SQL* because all combinations of EXTERNAL and > LOCATION are valid in Hive, but creating an external table with a default > location is not allowed by Spark. Note that Spark must still handle these > tables because it shares a metastore with Hive, which can still create them. > > Now catalogs can be plugged in, the question is whether to pass the fact > that EXTERNAL was in the CREATE TABLE statement to the v2 catalog > handling a create command, or to suppress it and apply Spark’s rule that > LOCATION must be present. > > If it is not passed to the catalog, then a Hive catalog cannot implement > the behavior of Hive SQL, even though Spark added the keyword for Hive > compatibility. The Spark catalog can interpret EXTERNAL however Spark > chooses to, but I think it is a poor choice to force different behavior on > other catalogs. > > Wenchen has also argued that the purpose of this is to standardize > behavior across catalogs. But hiding EXTERNAL would not accomplish that > goal. Whether to physically delete data is a choice that is up to the > catalog. Some catalogs have no “external” concept and will always drop data > when a table is dropped. The ability to keep underlying data files is > specific to a few catalogs, and whether that is controlled by EXTERNAL, > the LOCATION clause, or something else is still up to the catalog > implementation. > > I don’t think that there is a good reason to force catalogs to break > compatibility with Hive SQL, while making it appear as though DDL is > compatible. Because removing EXTERNAL would be a breaking change to the > SQL parser, I think the best option is to pass it to v2 catalogs so the > catalog can decide how to handle it. > > rb > > On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi all, >> >> I'd like to start a discussion thread about this topic, as it blocks an >> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL >> syntax. >> >> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden >> feature in Spark for Hive compatibility. >> >> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE >> ... USING parquet`, the parser fails and tells you that EXTERNAL can't >> be specified. >> >> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if >> LOCATION clause or path option is present. For example, `CREATE EXTERNAL >> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION >> clause or path option. This is not 100% Hive compatible. >> >> As we are unifying the CREATE TABLE SQL syntax, one problem is how to >> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it >> was, or we can officially support it. >> >> Please let us know your thoughts: >> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have >> you used it in production before? For what use cases? >> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE? >> It seems to me that it only makes sense for file source, as the table >> directory can be managed. I'm not sure how to interpret EXTERNAL in >> catalogs like jdbc, cassandra, etc. >> >> For more details, please refer to the long discussion in >> https://github.com/apache/spark/pull/28026 >> >> Thanks, >> Wenchen >> > > > -- > Ryan Blue > Software Engineer > Netflix >