Re: SQL language vs DataFrame API

Michael Armbrust Wed, 09 Dec 2015 10:44:25 -0800

I think that it is generally good to have parity when the functionality is
useful.  However, in some cases various features are there just to maintain
compatibility with other system.  For example CACHE TABLE is eager because
Shark's cache table was.  df.cache() is lazy because Spark's cache is.
Does that mean that we need to add some eager caching mechanism to
dataframes to have parity?  Probably not, users can just call .count() if
they want to force materialization.

Regarding the differences between HiveQL and the SQLParser, I think we
should get rid of the SQL parser.  Its kind of a hack that I built just so
that there was some SQL story for people who didn't compile with Hive.
Moving forward, I'd like to see the distinction between the HiveContext and
SQLContext removed and we can standardize on a single parser.  For this
reason I'd be opposed to spending a lot of dev/reviewer time on adding
features there.

On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <[email protected]>
wrote:

> Hi,
>
> I was wondering what the "official" view is on feature parity between SQL
> and DF apis. Docs are pretty sparse on the SQL front, and it seems that
> some features are only supported at various times in only one of Spark SQL
> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
> are some examples
>
> Is there an explicit goal of having consistent support for all features in
> both DF and SQL ?
>
> Thanks,
> Cristian
>

Re: SQL language vs DataFrame API

Reply via email to