[ https://issues.apache.org/jira/browse/HIVE-24245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chiran Ravani updated HIVE-24245: --------------------------------- Description: Vectorized PTF for count and distinct over partition is broken. It produces incorrect results. Below is the test case. {code} CREATE TABLE bigd781b_new ( id int, txt1 string, txt2 string, cda_date int, cda_job_name varchar(12)); INSERT INTO bigd781b_new VALUES (1,'2010005759','7164335675012038',20200528,'load1'), (2,'2010005759','7164335675012038',20200528,'load2'); {code} Running below query produces incorrect results {code} SELECT txt1, txt2, count(distinct txt1) over(partition by txt1) as n, count(distinct txt2) over(partition by txt2) as m FROM bigd781b_new {code} as below. {code} +-------------+-------------------+----+----+ | txt1 | txt2 | n | m | +-------------+-------------------+----+----+ | 2010005759 | 7164335675012038 | 2 | 2 | | 2010005759 | 7164335675012038 | 2 | 2 | +-------------+-------------------+----+----+ {code} While the correct output would be {code} +-------------+-------------------+----+----+ | txt1 | txt2 | n | m | +-------------+-------------------+----+----+ | 2010005759 | 7164335675012038 | 1 | 1 | | 2010005759 | 7164335675012038 | 1 | 1 | +-------------+-------------------+----+----+ {code} The problem does not appear after setting below property set hive.vectorized.execution.ptf.enabled=false; was: Vectorized PTF for count and distinct over partition is broken. It produces incorrect results. Below is the test case. {code} CREATE TABLE bigd781b_new ( id int, txt1 string, txt2 string, cda_date int, cda_job_name varchar(12)); INSERT INTO bigd781b_new VALUES (1,'2010005759','7164335675012038',20200528,'load1'), (2,'2010005759','7164335675012038',20200528,'load2'); {code} Running below query produces incorrect results {code} SELECT txt1, txt2, count(distinct txt1) over(partition by txt1) as n, count(distinct txt2) over(partition by txt2) as m FROM bigd781b_new WHERE cda_date = 20200528 and ( txt2 = '7164335675012038'); {code} as below. {code} +-------------+-------------------+----+----+ | txt1 | txt2 | n | m | +-------------+-------------------+----+----+ | 2010005759 | 7164335675012038 | 2 | 2 | | 2010005759 | 7164335675012038 | 2 | 2 | +-------------+-------------------+----+----+ {code} While the correct output would be {code} +-------------+-------------------+----+----+ | txt1 | txt2 | n | m | +-------------+-------------------+----+----+ | 2010005759 | 7164335675012038 | 1 | 1 | | 2010005759 | 7164335675012038 | 1 | 1 | +-------------+-------------------+----+----+ {code} The problem does not appear after setting below property set hive.vectorized.execution.ptf.enabled=false; > Vectorized PTF with count and distinct over partition producing incorrect > results. > ---------------------------------------------------------------------------------- > > Key: HIVE-24245 > URL: https://issues.apache.org/jira/browse/HIVE-24245 > Project: Hive > Issue Type: Bug > Components: Hive, PTF-Windowing, Vectorization > Affects Versions: 3.1.0, 3.1.2 > Reporter: Chiran Ravani > Priority: Critical > > Vectorized PTF for count and distinct over partition is broken. It produces > incorrect results. > Below is the test case. > {code} > CREATE TABLE bigd781b_new ( > id int, > txt1 string, > txt2 string, > cda_date int, > cda_job_name varchar(12)); > INSERT INTO bigd781b_new VALUES > (1,'2010005759','7164335675012038',20200528,'load1'), > (2,'2010005759','7164335675012038',20200528,'load2'); > {code} > Running below query produces incorrect results > {code} > SELECT > txt1, > txt2, > count(distinct txt1) over(partition by txt1) as n, > count(distinct txt2) over(partition by txt2) as m > FROM bigd781b_new > {code} > as below. > {code} > +-------------+-------------------+----+----+ > | txt1 | txt2 | n | m | > +-------------+-------------------+----+----+ > | 2010005759 | 7164335675012038 | 2 | 2 | > | 2010005759 | 7164335675012038 | 2 | 2 | > +-------------+-------------------+----+----+ > {code} > While the correct output would be > {code} > +-------------+-------------------+----+----+ > | txt1 | txt2 | n | m | > +-------------+-------------------+----+----+ > | 2010005759 | 7164335675012038 | 1 | 1 | > | 2010005759 | 7164335675012038 | 1 | 1 | > +-------------+-------------------+----+----+ > {code} > The problem does not appear after setting below property > set hive.vectorized.execution.ptf.enabled=false; -- This message was sent by Atlassian Jira (v8.3.4#803005)