Hi All,

I am having performance issue with data skew of the distinct statement in
Hive
<http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
See below query with DISTINCT operator.
*Original Query : *

SELECT DISTINCT
                 SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD;

Table *INTER_ETL *used for query is big enough.
>From the logs , it seems that data skew for specific set of values ,
causing one of reducer have to do all the job. I tried to achieve the same
through GROUP BY still having the same issue.  Help me to understand the
issue and resolution.
*Query with Distinct V2 :*

CREATE TEMPORARY TABLE ETL_TMP AS
> SELECT DISTINCT dt.*
> FROM (
> SELECT SD.REGION
> ,SD.HEADEND
> ,SD.NETWORK
> ,SD.RETAILUNITCODE
> ,SD.LOGTIMEDATE
> ,SD.SPOTKEY
> ,SD.CRE_DT
> ,CASE
> WHEN SD.LOGTIMEDATE IS NULL
> THEN 'Y'
> ELSE 'N'
> END AS DROP_REASON
> ,ROW_NUMBER() OVER (
> ORDER BY NULL
> ) AS ETL_ROW_ID
> FROM INTER_ETL AS SD
> ) AS dt;

Logs:

INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418
> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418
> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418
> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418
> INFO : Map 1: 107/107 *Reducer 2: 417(+1)/418* Reducer 3: 0(+418)/418



*Query With Group By:*

CREATE TEMPORARY TABLE ETL_TMP AS
> SELECT REGION
> ,HEADEND
> ,NETWORK
> ,RETAILUNITCODE
> ,LOGTIMEDATE
> ,SPOTKEY
> ,CRE_DT
> ,DROP_REASON
> ,ETL_ROW_ID
> FROM (
> SELECT SD.REGION
> ,SD.HEADEND
> ,SD.NETWORK
> ,SD.RETAILUNITCODE
> ,SD.LOGTIMEDATE
> ,SD.SPOTKEY
> ,SD.CRE_DT
> ,CASE
> WHEN SD.LOGTIMEDATE IS NULL
> THEN 'Y'
> ELSE 'N'
> END AS DROP_REASON
> ,ROW_NUMBER() OVER (
> ORDER BY NULL
> ) AS ETL_ROW_ID
> FROM INTER_ETL AS SD
> ) AS dt
> GROUP BY
>          REGION
> ,HEADEND
> ,NETWORK
> ,RETAILUNITCODE
> ,LOGTIMEDATE
> ,SPOTKEY
> ,CRE_DT
> ,DROP_REASON
> ,ETL_ROW_ID;


Logs:

INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418
> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418
> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418
> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418
> INFO : Map 1: 818/818 *Reducer 2: 417(+1)/418* Reducer 3: 0(+418)/418


*Table details :*

Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ;
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                 DFS
Output                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
| Found 15 items
                                                                  |
| -rwxrwxrwx   3 Z56034 hdfs 2075244899 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0  |
| -rwxrwxrwx   3 Z56034 hdfs 2090030620 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0  |
| -rwxrwxrwx   3 Z56034 hdfs 2025516774 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1986848213 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0  |
| -rwxrwxrwx   3 Z56034 hdfs 2018883723 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1984690335 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1987494444 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1974674515 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1963720218 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1965892384 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1974272622 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1971948208 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1968141886 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1970930771 2016-06-28 10:24
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0  |
| -rwxrwxrwx   3 Z56034 hdfs  192820628 2016-06-28 10:23
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0  |
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
 'numFiles'='15',

 'numRows'='108363614',



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

Reply via email to