Hive Query(Top N)

Ravisankar Mani Sun, 19 Jul 2015 23:53:56 -0700

Hi everyone

I have tried to  to achieve hierarchical based (index mode) top n creation
using spark query. it taken more time when i execute following query


Select SUM(`adventurepersoncontacts`.`contactid`)  AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname`
FROM  `default`.`adventurepersoncontacts` AS adventurepersoncontacts
 JOIN (
 SELECT
  `F_0`.`fullname_0_0` AS `fullname_0_0`,
  ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC)  AS `R_N_0`
  FROM(
   SELECT
    `adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
    SUM(adventurepersoncontacts.contactid) AS `Measure_0`
    FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
    GROUP BY `adventurepersoncontacts`.`fullname`
   )`F_0`
) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` )
AND (`T_0`.`R_N_0` <= 5)) GROUP BY `adventurepersoncontacts`.`fullname

In mentioned query, I have set row index in every group according to the
aggregation type. Row_number calculation according to the aggregation like
Row_Number() Over( order by sum( column name) order by DESC) not directly
supported in spark . So i have using sub query like

 SELECT
  `F_0`.`fullname_0_0` AS `fullname_0_0`,
  ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC)  AS `R_N_0`
  FROM(
   SELECT
    `adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
    SUM(adventurepersoncontacts.contactid) AS `Measure_0`
    FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
    GROUP BY `adventurepersoncontacts`.`fullname`
   )`F_0`
) `T_0`

The execution time was getting slowed when using following way. In case i
removed  main group by( remove  aggregation) method , it getting very fast
execution.
Refer following query.  I mentioned this query only for slow execution when
i include main group by


 Select  `adventurepersoncontacts`.`contactid`  AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname`
FROM  `default`.`adventurepersoncontacts` AS adventurepersoncontacts
 JOIN (
 SELECT
  `F_0`.`fullname_0_0` AS `fullname_0_0`,
  ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC)  AS `R_N_0`
  FROM(
   SELECT
    `adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
    SUM(adventurepersoncontacts.contactid) AS `Measure_0`
    FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
    GROUP BY `adventurepersoncontacts`.`fullname`
   )`F_0`
) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` )
AND (`T_0`.`R_N_0` <= 5))


I have found other way for getting this hierarchical  query.

I have created a temp table ( The table refers  inner sub query(refer
previous query) )

 "CREATE external table IF NOT EXISTS  temp_table  AS SELECT
SUM(adventurepersoncontacts.contactid) as contactid,
adventurepersoncontacts.fullname as fullname FROM
`default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY
`adventurepersoncontacts`.`fullname`";



"SELECT  SUM(`adventurepersoncontacts`.`contactid`) AS
`adventurepersoncontacts_contactid` ,`adventurepersoncontacts`.`fullname`
AS `adventurepersoncontacts_fullname` FROM
`default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN (SELECT
`fullname` AS `fullname_0_0`,ROW_NUMBER() over( order by `contactid` desc)
AS `R_N_0` FROM default.temp_table) `T_0` on
((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND
(`T_0`.`R_N_0` <= 2)) GROUP BY `adventurepersoncontacts`.`fullname`";

This is other way. But this execution time was delayed when using create
table using select statement


Please help any other  optimized way to achieve my requirement.


Regards,
Ravi

Hive Query(Top N)

Reply via email to