Re: Spark SQL support for sub-queries

Mich Talebzadeh Fri, 26 Feb 2016 12:03:57 -0800

Good stuff

I decided to do some boundary value analysis by getting records where the
ID (unique value) is IN (min() and max()


Unfortanely Hive SQL does not yet support more than one level of sub-query.
For example this operation is perfectly valid in Oracle

select * from dummy where id IN (select MIN(ID) from dummy) OR id IN
(select MAX(ID) from dummy);

It comes back with two rows

In Hive SQL you get the following error

select * from dummy where id IN (select MIN(ID) from dummy) OR id IN
(select MAX(ID) from dummy);
FAILED: SemanticException [Error 10249]: Line 1:66 Unsupported SubQuery
Expression 'id': Only 1 SubQuery expression is supported.

So the solution I found was to use UNION in Hive

SELECT * FROM dummy WHERE id IN (SELECT MIN(ID) FROM dummy)
UNION
SELECT * FROM dummy WHERE id IN (SELECT MAX(ID) FROM dummy)
ORDER BY id

It took 2 min, 6 sec to return two rows

In FP I decided to do this

val d = HiveContext.table("test.dummy")
//Obtain boundary values
val minValue: Int = d.agg(min(col("id"))).collect.apply(0).getInt(0)
val maxValue: Int = d.agg(max(col("id"))).collect.apply(0).getInt(0)
d.filter( col("id") === lit(minValue) || col("id") ===
lit(maxValue)).orderBy(col("id")).show

It returns the rows back in 1 min. Now I am not sure whether the FP code is
most optimised but still wins

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 February 2016 at 14:51, Yin Yang <[email protected]> wrote:

> Since collect is involved, the approach would be slower compared to the
> SQL Mich gave in his first email.
>
> On Fri, Feb 26, 2016 at 1:42 AM, Michał Zieliński <
> [email protected]> wrote:
>
>> You need to collect the value.
>>
>> val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0)
>> d.filter(col("id") === lit(m))
>>
>> On 26 February 2016 at 09:41, Mich Talebzadeh <[email protected]>
>> wrote:
>>
>>> Can this be done using DFs?
>>>
>>>
>>>
>>> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>
>>> scala> val d = HiveContext.table("test.dummy")
>>> d: org.apache.spark.sql.DataFrame = [id: int, clustered: int, scattered:
>>> int, randomised: int, random_string: string, small_vc: string, padding:
>>> string]
>>>
>>> scala>  var m = d.agg(max($"id"))
>>> m: org.apache.spark.sql.DataFrame = [max(id): int]
>>>
>>> How can I join these two? In other words I want to get all rows with id
>>> = m here?
>>>
>>> d.filter($"id" = m)  ?
>>>
>>> Thanks
>>>
>>> On 25/02/2016 22:58, Mohammad Tariq wrote:
>>>
>>> AFAIK, this isn't supported yet. A ticket
>>> <https://issues.apache.org/jira/browse/SPARK-4226> is in progress
>>> though.
>>>
>>>
>>>
>>> [image: http://] <http://about.me/mti>
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I guess the following confirms that Spark does bot support sub-queries
>>>>
>>>>
>>>>
>>>> val d = HiveContext.table("test.dummy")
>>>>
>>>> d.registerTempTable("tmp")
>>>>
>>>> HiveContext.sql("select * from tmp where id IN (select max(id) from
>>>> tmp)")
>>>>
>>>> It crashes
>>>>
>>>> The SQL works OK in Hive itself on the underlying table!
>>>>
>>>> select * from dummy where id IN (select max(id) from dummy);
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>
>

Re: Spark SQL support for sub-queries

Reply via email to