SEUNGJIN KWON created IMPALA-14084:
--------------------------------------

             Summary: Unexpected row count reduction when using UNION ALL with 
CTEs in Impala 4.0.0
                 Key: IMPALA-14084
                 URL: https://issues.apache.org/jira/browse/IMPALA-14084
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 4.0.0
            Reporter: SEUNGJIN KWON
             Fix For: Impala 4.0.0


{*}Description{*}:
In Impala version 4.0.0, we are observing an unexpected reduction in the number 
of rows returned when performing a {{UNION ALL}} operation between two Common 
Table Expressions (CTEs). Specifically:
 * {{CTE1}} returns approximately 70,000 rows.

 * {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}} 
clause to restrict the number of rows due to the large dataset).

When executing the following query:
 
{code:java}
-- create database if not exists as_adventure;

-- Step 1: Create Tables
-- Table 1: Main user activity logs
CREATE TABLE user_logs (
  user_id STRING,
  event_type STRING,
  event_time TIMESTAMP
)
STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
CREATE TABLE fraud_reference (
  user_id STRING,
  event_type STRING,
  event_time_str STRING
)
STORED AS PARQUET; 

-- Step 2: Insert Sample Data
-- Fraud cases
INSERT INTO fraud_reference VALUES
('user001', 'LOGIN', '2024/01/01 12:00:00'),
('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both 
fraud and non-fraud)
INSERT INTO user_logs VALUES
('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');


-- Step 3: Run the Problematic Query
WITH CTE1 AS (
  SELECT t.*
  FROM user_logs t
  INNER JOIN (
    SELECT
      user_id,
      CAST(event_type AS STRING) AS event_type,
      TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
    FROM fraud_reference
  ) fr
  ON t.user_id = fr.user_id
     AND t.event_type = fr.event_type
     AND t.event_time = fr.event_time
),CTE2 AS (
  SELECT t.*
  FROM user_logs t
  LEFT ANTI JOIN (
    SELECT
      user_id,
      CAST(event_type AS STRING) AS event_type,
      TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
    FROM fraud_reference
  ) fr
  ON t.user_id = fr.user_id
     AND t.event_type = fr.event_type
     AND t.event_time = fr.event_time
  WHERE rand() < 0.9
  LIMIT 1000000
)-- UNION ALL: causes unexpected row loss
SELECT *, 1 AS label FROM CTE1
UNION ALL
SELECT *, 0 AS label FROM CTE2;{code}
 
{{SELECT *, 1 AS label FROM CTE1}}
{{UNION ALL}}
{{SELECT *, 0 AS label FROM CTE2;}}
We expect to retrieve around 1,070,000 rows. However, the actual result 
contains only about 600 rows.

{*}Additional Information{*}:
 * Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.

 * Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.

 * Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually 
returns the expected number of rows.

 * This behavior is reminiscent of the issue described in IMPALA-10182, which 
was reportedly fixed in version 4.0.0. However, we are experiencing similar 
symptoms.

{*}Steps to Reproduce{*}:
 # Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.

 # Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.

 # Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.

 # Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing label 
to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).

 # Observe the total number of rows returned.

{*}Expected Result{*}:
Approximately 1,070,000 rows.

{*}Actual Result{*}:
Approximately 600 rows.

{*}Environment{*}:
 * Impala Version: 4.0.0.7.1.9.3-4 RELEASE

 * Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]

 * Cluster Configuration: [Provide details if relevant]

{*}Attachments{*}:
 * Query execution plans

 * Sample data (if applicable)

 * Logs or error messages (if any)

We would appreciate any insights into whether this is a regression of the 
previously resolved issue or a new bug. Please let us know if further 
information is required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to