[jira] [Created] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

Illya Yalovyy (JIRA) Mon, 17 Aug 2015 12:04:20 -0700

Illya Yalovyy created HIVE-11583:
------------------------------------

             Summary: When PTF is used over a large partitions result could be 
corrupted
                 Key: HIVE-11583
                 URL: https://issues.apache.org/jira/browse/HIVE-11583
             Project: Hive
          Issue Type: Bug
          Components: PTF-Windowing
    Affects Versions: 1.2.1, 1.2.0, 1.0.0, 0.13.1, 0.14.0, 0.14.1
         Environment: Hadoop 2.6 + Apache hive built from trunk


            Reporter: Illya Yalovyy
            Priority: Critical


Dataset: 
 Window has 50001 record (2 blocks on disk and 1 block in memory)
 Size of the second block is >32Mb (2 splits)

Result:
When the last block is read from the disk only first split is actually loaded. 
The second split gets missed. The total count of the result dataset is correct, 
but some records are missing and another are duplicated.

Example:
{code:sql}
CREATE TABLE ptf_big_src (
  id INT,
  key STRING,
  grp STRING,
  value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
TABLE ptf_big_src;

SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
---
-- A    25000
-- B    20000
-- C    5001
---

CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key ORDER 
BY grp) grp_num FROM ptf_big_src;

SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
-- 
-- A    34296
-- B    15704
-- C    1
---
{code}
Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

Reply via email to