Re: SQL language vs DataFrame API

Michael Armbrust Wed, 09 Dec 2015 16:30:59 -0800

Yeah, I would like to address any actual gaps in functionality that are
present.


On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris <[email protected]>
wrote:

> The reason I'm asking is because it's important in larger projects to be
> able to stick to a particular programming style. Some people are more
> comfortable with SQL, others might find the DF api more suitable, but it's
> important to have full expressivity in both to make it easier to adopt one
> approach rather than have to mix and match to achieve full functionality.
>
> On 9 December 2015 at 19:41, Xiao Li <[email protected]> wrote:
>
>> That sounds great! When it is decided, please let us know and we can add
>> more features and make it ANSI SQL compliant.
>>
>> Thank you!
>>
>> Xiao Li
>>
>>
>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust <[email protected]>:
>>
>>> I don't plan to abandon HiveQL compatibility, but I'd like to see us
>>> move towards something with more SQL compliance (perhaps just newer
>>> versions of the HiveQL parser).  Exactly which parser will do that for us
>>> is under investigation.
>>>
>>> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <[email protected]> wrote:
>>>
>>>> Hi, Michael,
>>>>
>>>> Does that mean SqlContext will be built on HiveQL in the near future?
>>>>
>>>> Thanks,
>>>>
>>>> Xiao Li
>>>>
>>>>
>>>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <[email protected]>:
>>>>
>>>>> I think that it is generally good to have parity when the
>>>>> functionality is useful.  However, in some cases various features are 
>>>>> there
>>>>> just to maintain compatibility with other system.  For example CACHE TABLE
>>>>> is eager because Shark's cache table was.  df.cache() is lazy because
>>>>> Spark's cache is.  Does that mean that we need to add some eager caching
>>>>> mechanism to dataframes to have parity?  Probably not, users can just call
>>>>> .count() if they want to force materialization.
>>>>>
>>>>> Regarding the differences between HiveQL and the SQLParser, I think we
>>>>> should get rid of the SQL parser.  Its kind of a hack that I built just so
>>>>> that there was some SQL story for people who didn't compile with Hive.
>>>>> Moving forward, I'd like to see the distinction between the HiveContext 
>>>>> and
>>>>> SQLContext removed and we can standardize on a single parser.  For this
>>>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>>>>> features there.
>>>>>
>>>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering what the "official" view is on feature parity between
>>>>>> SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems 
>>>>>> that
>>>>>> some features are only supported at various times in only one of Spark 
>>>>>> SQL
>>>>>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>>>>>> are some examples
>>>>>>
>>>>>> Is there an explicit goal of having consistent support for all
>>>>>> features in both DF and SQL ?
>>>>>>
>>>>>> Thanks,
>>>>>> Cristian
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SQL language vs DataFrame API

Reply via email to