SEUNGJIN KWON created IMPALA-14084:
--------------------------------------
Summary: Unexpected row count reduction when using UNION ALL with
CTEs in Impala 4.0.0
Key: IMPALA-14084
URL: https://issues.apache.org/jira/browse/IMPALA-14084
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 4.0.0
Reporter: SEUNGJIN KWON
Fix For: Impala 4.0.0
{*}Description{*}:
In Impala version 4.0.0, we are observing an unexpected reduction in the number
of rows returned when performing a {{UNION ALL}} operation between two Common
Table Expressions (CTEs). Specifically:
* {{CTE1}} returns approximately 70,000 rows.
* {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}}
clause to restrict the number of rows due to the large dataset).
When executing the following query:
{code:java}
-- create database if not exists as_adventure;
-- Step 1: Create Tables
-- Table 1: Main user activity logs
CREATE TABLE user_logs (
user_id STRING,
event_type STRING,
event_time TIMESTAMP
)
STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
CREATE TABLE fraud_reference (
user_id STRING,
event_type STRING,
event_time_str STRING
)
STORED AS PARQUET;
-- Step 2: Insert Sample Data
-- Fraud cases
INSERT INTO fraud_reference VALUES
('user001', 'LOGIN', '2024/01/01 12:00:00'),
('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both
fraud and non-fraud)
INSERT INTO user_logs VALUES
('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');
-- Step 3: Run the Problematic Query
WITH CTE1 AS (
SELECT t.*
FROM user_logs t
INNER JOIN (
SELECT
user_id,
CAST(event_type AS STRING) AS event_type,
TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
FROM fraud_reference
) fr
ON t.user_id = fr.user_id
AND t.event_type = fr.event_type
AND t.event_time = fr.event_time
),CTE2 AS (
SELECT t.*
FROM user_logs t
LEFT ANTI JOIN (
SELECT
user_id,
CAST(event_type AS STRING) AS event_type,
TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
FROM fraud_reference
) fr
ON t.user_id = fr.user_id
AND t.event_type = fr.event_type
AND t.event_time = fr.event_time
WHERE rand() < 0.9
LIMIT 1000000
)-- UNION ALL: causes unexpected row loss
SELECT *, 1 AS label FROM CTE1
UNION ALL
SELECT *, 0 AS label FROM CTE2;{code}
{{SELECT *, 1 AS label FROM CTE1}}
{{UNION ALL}}
{{SELECT *, 0 AS label FROM CTE2;}}
We expect to retrieve around 1,070,000 rows. However, the actual result
contains only about 600 rows.
{*}Additional Information{*}:
* Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.
* Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.
* Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually
returns the expected number of rows.
* This behavior is reminiscent of the issue described in IMPALA-10182, which
was reportedly fixed in version 4.0.0. However, we are experiencing similar
symptoms.
{*}Steps to Reproduce{*}:
# Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.
# Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.
# Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.
# Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing label
to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).
# Observe the total number of rows returned.
{*}Expected Result{*}:
Approximately 1,070,000 rows.
{*}Actual Result{*}:
Approximately 600 rows.
{*}Environment{*}:
* Impala Version: 4.0.0.7.1.9.3-4 RELEASE
* Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
* Cluster Configuration: [Provide details if relevant]
{*}Attachments{*}:
* Query execution plans
* Sample data (if applicable)
* Logs or error messages (if any)
We would appreciate any insights into whether this is a regression of the
previously resolved issue or a new bug. Please let us know if further
information is required.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]