They are based on a physical column, the column is real. The function just only exists in the datasource.
For example Select ttl(a), ttl(b) FROM table ks.tab On Tue, Sep 4, 2018 at 11:16 PM Reynold Xin <r...@databricks.com> wrote: > Russell your special columns wouldn’t actually work with option 1 because > Spark would have to fail them in analysis without an actual physical > column. > > On Tue, Sep 4, 2018 at 9:12 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I'm a big fan of 1 as well. I had to implement something similar using >> custom expressions and it was a bit more work than it should be. In >> particular our use case is that columns have certain metadata (ttl, >> writetime) which exist not as separate columns but as special values which >> can be surfaced. >> >> I still don't have a good solution for the same thing at write-time >> though since the problem is a bit asymmetric for us. While you can read a >> metadata from any particular cell, on write you specify it for the whole >> row. >> >> On Tue, Sep 4, 2018 at 11:04 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Thanks for posting the summary. I'm strongly in favor of option 1. >>> >>> I think that API footprint is fairly small, but worth it. Not only does >>> it make sources easier to implement by handling parsing, it also makes >>> sources more reliable because Spark handles validation the same way across >>> sources. >>> >>> A good example is making sure that the referenced columns exist in the >>> table, which should be done using the case sensitivity of the analyzer. >>> Spark would pass normalized column names that match the case of the >>> declared columns to ensure that there isn't a problem if Spark is case >>> insensitive but the source doesn't implement it. And the source wouldn't >>> have to know about Spark's case sensitivity settings at all. >>> >>> On Tue, Sep 4, 2018 at 7:46 PM Reynold Xin <r...@databricks.com> wrote: >>> >>>> Ryan, Michael and I discussed this offline today. Some notes here: >>>> >>>> His use case is to support partitioning data by derived columns, rather >>>> than physical columns, because he didn't want his users to keep adding the >>>> "date" column when in reality they are purely derived from some timestamp >>>> column. We reached consensus on this is a great use case and something we >>>> should support. >>>> >>>> We are still debating how to do this at API level. Two options: >>>> >>>> *Option 1.* Create a smaller surfaced, parallel Expression library, >>>> and use that for specifying partition columns. The bare minimum class >>>> hierarchy would look like: >>>> >>>> trait Expression >>>> >>>> class NamedFunction(name: String, args: Seq[Expression]) extends >>>> Expression >>>> >>>> class Literal(value: Any) extends Expression >>>> >>>> class ColumnReference(name: String) extends Expression >>>> >>>> These classes don't define how the expressions are evaluated, and it'd >>>> be up to the data sources to interpret them. As an example, for a table >>>> partitioned by date(ts), Spark would pass the following to the underlying >>>> ds: >>>> >>>> NamedFunction("date", ColumnReference("timestamp") :: Nil) >>>> >>>> >>>> *Option 2.* Spark passes strings over to the data sources. For the >>>> above example, Spark simply passes "date(ts)" as a string over. >>>> >>>> >>>> The pros/cons of 1 vs 2 are basically the inverse of each other. Option >>>> 1 creates more rigid structure, with extra complexity in API design. Option >>>> 2 is less structured but more flexible. Option 1 gives Spark the >>>> opportunity to enforce column references are valid (but not the actual >>>> function names), whereas option 2 would be up to the data sources to >>>> validate. >>>> >>>> >>>> >>>> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> I think I found a good solution to the problem of using Expression in >>>>> the TableCatalog API and in the DeleteSupport API. >>>>> >>>>> For DeleteSupport, there is already a stable and public subset of >>>>> Expression named Filter that can be used to pass filters. The reason why >>>>> DeleteSupport would use Expression is to support more complex expressions >>>>> like to_date(ts) = '2018-08-15' that are translated to ts >= >>>>> 1534316400000000 AND ts < 1534402800000000. But, this can be done in >>>>> Spark instead of the data sources so I think DeleteSupport should use >>>>> Filter instead. I updated the DeleteSupport PR #21308 >>>>> <https://github.com/apache/spark/pull/21308> with these changes. >>>>> >>>>> Also, I agree that the DataSourceV2 API should also not expose >>>>> Expression, so I opened SPARK-25127 to track removing >>>>> SupportsPushDownCatalystFilter >>>>> <https://issues.apache.org/jira/browse/SPARK-25127>. >>>>> >>>>> For TableCatalog, I took a similar approach instead of introducing a >>>>> parallel Expression API. Instead, I created a PartitionTransform API (like >>>>> Filter) that communicates the transformation function, function parameters >>>>> like num buckets, and column references. I updated the TableCatalog >>>>> PR #21306 <https://github.com/apache/spark/pull/21306> to use >>>>> PartitionTransform instead of Expression and I updated the text of the >>>>> SPIP >>>>> doc >>>>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d> >>>>> . >>>>> >>>>> I also raised a concern about needing to wait for Spark to add support >>>>> for new expressions (now partition transforms). To get around this, I >>>>> added >>>>> an apply transform that passes the name of a function and an input >>>>> column. That way, users can still pass transforms that Spark doesn’t know >>>>> about by name to data sources: apply("source_function", "colName"). >>>>> >>>>> Please have a look at the updated pull requests and SPIP doc and >>>>> comment! >>>>> >>>>> rb >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >>