[jira] [Updated] (IMPALA-14084) Unexpected row count reduction when using UNION ALL with CTEs in Impala 4.0.0

SEUNGJIN KWON (Jira) Mon, 19 May 2025 19:09:24 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


SEUNGJIN KWON updated IMPALA-14084:
-----------------------------------
    Description: 
{*}Description{*}:
In Impala version 4.0.0, we are observing an unexpected reduction in the number 
of rows returned when performing a {{UNION ALL}} operation between two Common 
Table Expressions (CTEs). Specifically:
 * {{CTE1}} returns approximately 70,000 rows.

 * {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}} 
clause to restrict the number of rows due to the large dataset).

When executing the following query:
 
{code:java}
-- create database if not exists as_adventure;

-- Step 1: Create Tables
-- Table 1: Main user activity logs
CREATE TABLE user_logs (
  user_id STRING,
  event_type STRING,
  event_time TIMESTAMP
)
STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
CREATE TABLE fraud_reference (
  user_id STRING,
  event_type STRING,
  event_time_str STRING
)
STORED AS PARQUET; 

-- Step 2: Insert Sample Data
-- Fraud cases
INSERT INTO fraud_reference VALUES
('user001', 'LOGIN', '2024/01/01 12:00:00'),
('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both 
fraud and non-fraud)
INSERT INTO user_logs VALUES
('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');


-- Step 3: Run the Problematic Query
WITH CTE1 AS (
  SELECT t.*
  FROM user_logs t
  INNER JOIN (
    SELECT
      user_id,
      CAST(event_type AS STRING) AS event_type,
      TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
    FROM fraud_reference
  ) fr
  ON t.user_id = fr.user_id
     AND t.event_type = fr.event_type
     AND t.event_time = fr.event_time
),CTE2 AS (
  SELECT t.*
  FROM user_logs t
  LEFT ANTI JOIN (
    SELECT
      user_id,
      CAST(event_type AS STRING) AS event_type,
      TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
    FROM fraud_reference
  ) fr
  ON t.user_id = fr.user_id
     AND t.event_type = fr.event_type
     AND t.event_time = fr.event_time
  WHERE rand() < 0.9
  LIMIT 1000000
)-- UNION ALL: causes unexpected row loss

SELECT *, 1 AS label FROM CTE1
UNION ALL
SELECT *, 0 AS label FROM CTE2;{code}
 
{{SELECT *, 1 AS label FROM CTE1}}
{{UNION ALL}}
{{SELECT *, 0 AS label FROM CTE2;}}

We expect to retrieve around 1,070,000 rows. However, the actual result 
contains only about 1,000,600 rows. 

{*}Additional Information{*}:
 * Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.

 * Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.

 * Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually 
returns the expected number of rows.

 * This behavior is reminiscent of the issue described in IMPALA-10182, which 
was reportedly fixed in version 4.0.0. However, we are experiencing similar 
symptoms.

{*}Steps to Reproduce{*}:
 # Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.

 # Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.

 # Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.

 # Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing label 
to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).

 # Observe the total number of rows returned.

{*}Expected Result{*}:
Approximately 1,070,000 rows.

{*}Actual Result{*}:
 * CTE2 correctly returns 1,000,000 rows

 * {{CTE1}} unexpectedly returns *only ~600 rows*

 * The final {{UNION ALL}} output contains {*}~1,000,600 rows{*}, which is 
*missing ~69,400 rows* from the CTE1

{*}Environment{*}:
 * Impala Version: 4.0.0.7.1.9.3-4 RELEASE

 * Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
 * Cluster Configuration: [Provide details if relevant]

I would appreciate any insights into whether this is a regression of the 
previously resolved issue or a new bug. Please let me know if further 
information is required.

  was:
{*}Description{*}:
In Impala version 4.0.0, we are observing an unexpected reduction in the number 
of rows returned when performing a {{UNION ALL}} operation between two Common 
Table Expressions (CTEs). Specifically:
 * {{CTE1}} returns approximately 70,000 rows.

 * {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}} 
clause to restrict the number of rows due to the large dataset).

When executing the following query:
 
{code:java}
-- create database if not exists as_adventure;

-- Step 1: Create Tables
-- Table 1: Main user activity logs
CREATE TABLE user_logs (
  user_id STRING,
  event_type STRING,
  event_time TIMESTAMP
)
STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
CREATE TABLE fraud_reference (
  user_id STRING,
  event_type STRING,
  event_time_str STRING
)
STORED AS PARQUET; 

-- Step 2: Insert Sample Data
-- Fraud cases
INSERT INTO fraud_reference VALUES
('user001', 'LOGIN', '2024/01/01 12:00:00'),
('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both 
fraud and non-fraud)
INSERT INTO user_logs VALUES
('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');


-- Step 3: Run the Problematic Query
WITH CTE1 AS (
  SELECT t.*
  FROM user_logs t
  INNER JOIN (
    SELECT
      user_id,
      CAST(event_type AS STRING) AS event_type,
      TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
    FROM fraud_reference
  ) fr
  ON t.user_id = fr.user_id
     AND t.event_type = fr.event_type
     AND t.event_time = fr.event_time
),CTE2 AS (
  SELECT t.*
  FROM user_logs t
  LEFT ANTI JOIN (
    SELECT
      user_id,
      CAST(event_type AS STRING) AS event_type,
      TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
    FROM fraud_reference
  ) fr
  ON t.user_id = fr.user_id
     AND t.event_type = fr.event_type
     AND t.event_time = fr.event_time
  WHERE rand() < 0.9
  LIMIT 1000000
)-- UNION ALL: causes unexpected row loss

SELECT *, 1 AS label FROM CTE1
UNION ALL
SELECT *, 0 AS label FROM CTE2;{code}
 
{{SELECT *, 1 AS label FROM CTE1}}
{{UNION ALL}}
{{SELECT *, 0 AS label FROM CTE2;}}

We expect to retrieve around 1,070,000 rows. However, the actual result 
contains only about 1,000,600 rows. 

{*}Additional Information{*}:
 * Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.

 * Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.

 * Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually 
returns the expected number of rows.

 * This behavior is reminiscent of the issue described in IMPALA-10182, which 
was reportedly fixed in version 4.0.0. However, we are experiencing similar 
symptoms.

{*}Steps to Reproduce{*}:
 # Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.

 # Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.

 # Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.

 # Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing label 
to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).

 # Observe the total number of rows returned.

{*}Expected Result{*}:
Approximately 1,070,000 rows.

{*}Actual Result{*}:
Approximately 1,000,600 rows.

{*}Environment{*}:
 * Impala Version: 4.0.0.7.1.9.3-4 RELEASE

 * Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
 * Cluster Configuration: [Provide details if relevant]

I would appreciate any insights into whether this is a regression of the 
previously resolved issue or a new bug. Please let me know if further 
information is required.


> Unexpected row count reduction when using UNION ALL with CTEs in Impala 4.0.0
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-14084
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14084
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.0.0
>            Reporter: SEUNGJIN KWON
>            Priority: Blocker
>             Fix For: Impala 4.0.0
>
>
> {*}Description{*}:
> In Impala version 4.0.0, we are observing an unexpected reduction in the 
> number of rows returned when performing a {{UNION ALL}} operation between two 
> Common Table Expressions (CTEs). Specifically:
>  * {{CTE1}} returns approximately 70,000 rows.
>  * {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}} 
> clause to restrict the number of rows due to the large dataset).
> When executing the following query:
>  
> {code:java}
> -- create database if not exists as_adventure;
> -- Step 1: Create Tables
> -- Table 1: Main user activity logs
> CREATE TABLE user_logs (
>   user_id STRING,
>   event_type STRING,
>   event_time TIMESTAMP
> )
> STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
> CREATE TABLE fraud_reference (
>   user_id STRING,
>   event_type STRING,
>   event_time_str STRING
> )
> STORED AS PARQUET; 
> -- Step 2: Insert Sample Data
> -- Fraud cases
> INSERT INTO fraud_reference VALUES
> ('user001', 'LOGIN', '2024/01/01 12:00:00'),
> ('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both 
> fraud and non-fraud)
> INSERT INTO user_logs VALUES
> ('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
> ('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
> ('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
> ('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
> ('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
> ('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');
> -- Step 3: Run the Problematic Query
> WITH CTE1 AS (
>   SELECT t.*
>   FROM user_logs t
>   INNER JOIN (
>     SELECT
>       user_id,
>       CAST(event_type AS STRING) AS event_type,
>       TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
>     FROM fraud_reference
>   ) fr
>   ON t.user_id = fr.user_id
>      AND t.event_type = fr.event_type
>      AND t.event_time = fr.event_time
> ),CTE2 AS (
>   SELECT t.*
>   FROM user_logs t
>   LEFT ANTI JOIN (
>     SELECT
>       user_id,
>       CAST(event_type AS STRING) AS event_type,
>       TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
>     FROM fraud_reference
>   ) fr
>   ON t.user_id = fr.user_id
>      AND t.event_type = fr.event_type
>      AND t.event_time = fr.event_time
>   WHERE rand() < 0.9
>   LIMIT 1000000
> )-- UNION ALL: causes unexpected row loss
> SELECT *, 1 AS label FROM CTE1
> UNION ALL
> SELECT *, 0 AS label FROM CTE2;{code}
>  
> {{SELECT *, 1 AS label FROM CTE1}}
> {{UNION ALL}}
> {{SELECT *, 0 AS label FROM CTE2;}}
> We expect to retrieve around 1,070,000 rows. However, the actual result 
> contains only about 1,000,600 rows. 
> {*}Additional Information{*}:
>  * Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.
>  * Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.
>  * Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually 
> returns the expected number of rows.
>  * This behavior is reminiscent of the issue described in IMPALA-10182, which 
> was reportedly fixed in version 4.0.0. However, we are experiencing similar 
> symptoms.
> {*}Steps to Reproduce{*}:
>  # Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.
>  # Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.
>  # Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.
>  # Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing 
> label to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).
>  # Observe the total number of rows returned.
> {*}Expected Result{*}:
> Approximately 1,070,000 rows.
> {*}Actual Result{*}:
>  * CTE2 correctly returns 1,000,000 rows
>  * {{CTE1}} unexpectedly returns *only ~600 rows*
>  * The final {{UNION ALL}} output contains {*}~1,000,600 rows{*}, which is 
> *missing ~69,400 rows* from the CTE1
> {*}Environment{*}:
>  * Impala Version: 4.0.0.7.1.9.3-4 RELEASE
>  * Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
>  * Cluster Configuration: [Provide details if relevant]
> I would appreciate any insights into whether this is a regression of the 
> previously resolved issue or a new bug. Please let me know if further 
> information is required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-14084) Unexpected row count reduction when using UNION ALL with CTEs in Impala 4.0.0

Reply via email to