Good stuff
I decided to do some boundary value analysis by getting records where the
ID (unique value) is IN (min() and max()
Unfortanely Hive SQL does not yet support more than one level of sub-query.
For example this operation is perfectly valid in Oracle
select * from dummy where id IN (select MIN(ID) from dummy) OR id IN
(select MAX(ID) from dummy);
It comes back with two rows
In Hive SQL you get the following error
select * from dummy where id IN (select MIN(ID) from dummy) OR id IN
(select MAX(ID) from dummy);
FAILED: SemanticException [Error 10249]: Line 1:66 Unsupported SubQuery
Expression 'id': Only 1 SubQuery expression is supported.
So the solution I found was to use UNION in Hive
SELECT * FROM dummy WHERE id IN (SELECT MIN(ID) FROM dummy)
UNION
SELECT * FROM dummy WHERE id IN (SELECT MAX(ID) FROM dummy)
ORDER BY id
It took 2 min, 6 sec to return two rows
In FP I decided to do this
val d = HiveContext.table("test.dummy")
//Obtain boundary values
val minValue: Int = d.agg(min(col("id"))).collect.apply(0).getInt(0)
val maxValue: Int = d.agg(max(col("id"))).collect.apply(0).getInt(0)
d.filter( col("id") === lit(minValue) || col("id") ===
lit(maxValue)).orderBy(col("id")).show
It returns the rows back in 1 min. Now I am not sure whether the FP code is
most optimised but still wins
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 26 February 2016 at 14:51, Yin Yang <[email protected]> wrote:
> Since collect is involved, the approach would be slower compared to the
> SQL Mich gave in his first email.
>
> On Fri, Feb 26, 2016 at 1:42 AM, Michał Zieliński <
> [email protected]> wrote:
>
>> You need to collect the value.
>>
>> val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0)
>> d.filter(col("id") === lit(m))
>>
>> On 26 February 2016 at 09:41, Mich Talebzadeh <[email protected]>
>> wrote:
>>
>>> Can this be done using DFs?
>>>
>>>
>>>
>>> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>
>>> scala> val d = HiveContext.table("test.dummy")
>>> d: org.apache.spark.sql.DataFrame = [id: int, clustered: int, scattered:
>>> int, randomised: int, random_string: string, small_vc: string, padding:
>>> string]
>>>
>>> scala> var m = d.agg(max($"id"))
>>> m: org.apache.spark.sql.DataFrame = [max(id): int]
>>>
>>> How can I join these two? In other words I want to get all rows with id
>>> = m here?
>>>
>>> d.filter($"id" = m) ?
>>>
>>> Thanks
>>>
>>> On 25/02/2016 22:58, Mohammad Tariq wrote:
>>>
>>> AFAIK, this isn't supported yet. A ticket
>>> <https://issues.apache.org/jira/browse/SPARK-4226> is in progress
>>> though.
>>>
>>>
>>>
>>> [image: http://] <http://about.me/mti>
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I guess the following confirms that Spark does bot support sub-queries
>>>>
>>>>
>>>>
>>>> val d = HiveContext.table("test.dummy")
>>>>
>>>> d.registerTempTable("tmp")
>>>>
>>>> HiveContext.sql("select * from tmp where id IN (select max(id) from
>>>> tmp)")
>>>>
>>>> It crashes
>>>>
>>>> The SQL works OK in Hive itself on the underlying table!
>>>>
>>>> select * from dummy where id IN (select max(id) from dummy);
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>
>