Re: Spark SQL support for sub-queries

Michał Zieliński Fri, 26 Feb 2016 01:55:17 -0800

Spark has a great documentation
<https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.package>
and
guides <https://spark.apache.org/docs/latest/programming-guide.html>:


lit and col are here
<https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.package>
getInt is here
<https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Row>
apply(0) is just a method on Array which is returned by collect (here
<https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame>
)

On 26 February 2016 at 10:47, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Michael. Great
>
>  d.filter(col("id") === lit(m)).show
>
> BTW where all these methods like lit etc are documented. Also I guess any
> action call like apply(0) or getInt(0) refers to the "current" parameter?
>
> Regards
>
> On 26 February 2016 at 09:42, Michał Zieliński <
> zielinski.mich...@gmail.com> wrote:
>
>> You need to collect the value.
>>
>> val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0)
>> d.filter(col("id") === lit(m))
>>
>> On 26 February 2016 at 09:41, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Can this be done using DFs?
>>>
>>>
>>>
>>> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>
>>> scala> val d = HiveContext.table("test.dummy")
>>> d: org.apache.spark.sql.DataFrame = [id: int, clustered: int, scattered:
>>> int, randomised: int, random_string: string, small_vc: string, padding:
>>> string]
>>>
>>> scala>  var m = d.agg(max($"id"))
>>> m: org.apache.spark.sql.DataFrame = [max(id): int]
>>>
>>> How can I join these two? In other words I want to get all rows with id
>>> = m here?
>>>
>>> d.filter($"id" = m)  ?
>>>
>>> Thanks
>>>
>>> On 25/02/2016 22:58, Mohammad Tariq wrote:
>>>
>>> AFAIK, this isn't supported yet. A ticket
>>> <https://issues.apache.org/jira/browse/SPARK-4226> is in progress
>>> though.
>>>
>>>
>>>
>>> [image: http://] <http://about.me/mti>
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh <
>>> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I guess the following confirms that Spark does bot support sub-queries
>>>>
>>>>
>>>>
>>>> val d = HiveContext.table("test.dummy")
>>>>
>>>> d.registerTempTable("tmp")
>>>>
>>>> HiveContext.sql("select * from tmp where id IN (select max(id) from
>>>> tmp)")
>>>>
>>>> It crashes
>>>>
>>>> The SQL works OK in Hive itself on the underlying table!
>>>>
>>>> select * from dummy where id IN (select max(id) from dummy);
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>
>

Re: Spark SQL support for sub-queries

Reply via email to