[ https://issues.apache.org/jira/browse/HIVE-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Johannes Mayer updated HIVE-17499: ---------------------------------- Description: The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below) {code:sql} insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01') select year(ds), month(ds), day(ds), c8, c11, count(*) from thesis.clickstream_landing where ds = '2016-04-01' group by year(ds), month(ds), day(ds), c8, c11 With Cube; {code} Then I check for duplicates: {code:sql} select year, month, day, country, city, count (*) from thesis.clickstream_export where ds_year = '2016' and ds_month = '04' and ds_day = '01' group by year, month, day, country, city having count(*) > 1; {code} The result is: year month day country city _c5 null null null null null 4 null null 1 null null 4 null 4 null null null 4 null 4 1 null null 4 2016 null null null null 4 2016 null 1 null null 4 2016 4 null null null 4 2016 4 1 null null 4 When i do the same thing in Pig, everything is fine: {code:none} DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader(); FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city; CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city); D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count; STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02'); {code} Then again I check for duplicates: {code:sql} select year, month, day, country, city, count (*) from thesis.clickstream_export where ds_year = '2016' and ds_month = '04' and ds_day = '02' group by year, month, day, country, city having count(*) > 1; {code} And the result is empty as it should be. was: The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below) {code:sql} insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01') select year(ds), month(ds), day(ds), c8, c11, count(*) from thesis.clickstream_landing where ds = '2016-04-01' group by year(ds), month(ds), day(ds), c8, c11 With Cube; {code} Then I check for duplicates: {code:sql} select year, month, day, country, city, count (*) from thesis.clickstream_export where ds_year = '2016' and ds_month = '04' and ds_day = '01' group by year, month, day, country, city having count(*) > 1; {code} The result is: year month day country city _c5 null null null null null 4 null null 1 null null 4 null 4 null null null 4 null 4 1 null null 4 2016 null null null null 4 2016 null 1 null null 4 2016 4 null null null 4 2016 4 1 null null 4 When i do the same thing in Pig, everything is fine: {code:pig} DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader(); FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city; CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city); D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count; STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02'); {code} Then again I check for duplicates: {code:sql} select year, month, day, country, city, count (*) from thesis.clickstream_export where ds_year = '2016' and ds_month = '04' and ds_day = '02' group by year, month, day, country, city having count(*) > 1; {code} And the result is empty as it should be. > Hive Cube Operator returns duplicate rows > ----------------------------------------- > > Key: HIVE-17499 > URL: https://issues.apache.org/jira/browse/HIVE-17499 > Project: Hive > Issue Type: Bug > Affects Versions: 2.1.0 > Environment: Hortonworks HDP 2.6.0.1 > Reporter: Johannes Mayer > Priority: Critical > > The cube Operator returns duplicate rows, when it shouldnt. I ran the same > query in Pig and got the correct result. (see the example below) > {code:sql} > insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' > , ds_month = '04' , ds_day = '01') > select year(ds), month(ds), day(ds), c8, c11, count(*) > from thesis.clickstream_landing > where ds = '2016-04-01' > group by year(ds), month(ds), day(ds), c8, c11 > With Cube; > {code} > Then I check for duplicates: > {code:sql} > select year, month, day, country, city, count (*) from > thesis.clickstream_export > where ds_year = '2016' and ds_month = '04' and ds_day = '01' > group by year, month, day, country, city > having count(*) > 1; > {code} > The result is: > year month day country city _c5 > null null null null null 4 > null null 1 null null 4 > null 4 null null null 4 > null 4 1 null null 4 > 2016 null null null null 4 > 2016 null 1 null null 4 > 2016 4 null null null 4 > 2016 4 1 null null 4 > When i do the same thing in Pig, everything is fine: > {code:none} > DATA = LOAD 'thesis.clickstream_landing' USING > org.apache.hive.hcatalog.pig.HCatLoader(); > FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, > GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) > AS day, c8 AS country, c11 AS city; > CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city); > D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, > city), COUNT_STAR(cube) As click_count; > STORE D INTO 'thesis.clickstream_export' USING > org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, > ds_day=02'); > {code} > Then again I check for duplicates: > {code:sql} > select year, month, day, country, city, count (*) from > thesis.clickstream_export > where ds_year = '2016' and ds_month = '04' and ds_day = '02' > group by year, month, day, country, city > having count(*) > 1; > {code} > And the result is empty as it should be. -- This message was sent by Atlassian JIRA (v6.4.14#64029)