Hi everyone
I have tried to to achieve hierarchical based (index mode) top n creation
using spark query. it taken more time when i execute following query
Select SUM(`adventurepersoncontacts`.`contactid`) AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname`
FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
JOIN (
SELECT
`F_0`.`fullname_0_0` AS `fullname_0_0`,
ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0`
FROM(
SELECT
`adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
SUM(adventurepersoncontacts.contactid) AS `Measure_0`
FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
GROUP BY `adventurepersoncontacts`.`fullname`
)`F_0`
) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` )
AND (`T_0`.`R_N_0` <= 5)) GROUP BY `adventurepersoncontacts`.`fullname
In mentioned query, I have set row index in every group according to the
aggregation type. Row_number calculation according to the aggregation like
Row_Number() Over( order by sum( column name) order by DESC) not directly
supported in spark . So i have using sub query like
SELECT
`F_0`.`fullname_0_0` AS `fullname_0_0`,
ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0`
FROM(
SELECT
`adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
SUM(adventurepersoncontacts.contactid) AS `Measure_0`
FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
GROUP BY `adventurepersoncontacts`.`fullname`
)`F_0`
) `T_0`
The execution time was getting slowed when using following way. In case i
removed main group by( remove aggregation) method , it getting very fast
execution.
Refer following query. I mentioned this query only for slow execution when
i include main group by
Select `adventurepersoncontacts`.`contactid` AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname`
FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
JOIN (
SELECT
`F_0`.`fullname_0_0` AS `fullname_0_0`,
ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0`
FROM(
SELECT
`adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
SUM(adventurepersoncontacts.contactid) AS `Measure_0`
FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
GROUP BY `adventurepersoncontacts`.`fullname`
)`F_0`
) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` )
AND (`T_0`.`R_N_0` <= 5))
I have found other way for getting this hierarchical query.
I have created a temp table ( The table refers inner sub query(refer
previous query) )
"CREATE external table IF NOT EXISTS temp_table AS SELECT
SUM(adventurepersoncontacts.contactid) as contactid,
adventurepersoncontacts.fullname as fullname FROM
`default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY
`adventurepersoncontacts`.`fullname`";
"SELECT SUM(`adventurepersoncontacts`.`contactid`) AS
`adventurepersoncontacts_contactid` ,`adventurepersoncontacts`.`fullname`
AS `adventurepersoncontacts_fullname` FROM
`default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN (SELECT
`fullname` AS `fullname_0_0`,ROW_NUMBER() over( order by `contactid` desc)
AS `R_N_0` FROM default.temp_table) `T_0` on
((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND
(`T_0`.`R_N_0` <= 2)) GROUP BY `adventurepersoncontacts`.`fullname`";
This is other way. But this execution time was delayed when using create
table using select statement
Please help any other optimized way to achieve my requirement.
Regards,
Ravi