[
https://issues.apache.org/jira/browse/IMPALA-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
SEUNGJIN KWON updated IMPALA-14084:
-----------------------------------
Description:
{*}Description{*}:
In Impala version 4.0.0, we are observing an unexpected reduction in the number
of rows returned when performing a {{UNION ALL}} operation between two Common
Table Expressions (CTEs). Specifically:
* {{CTE1}} returns approximately 70,000 rows.
* {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}}
clause to restrict the number of rows due to the large dataset).
When executing the following query:
{code:java}
-- create database if not exists as_adventure;
-- Step 1: Create Tables
-- Table 1: Main user activity logs
CREATE TABLE user_logs (
user_id STRING,
event_type STRING,
event_time TIMESTAMP
)
STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
CREATE TABLE fraud_reference (
user_id STRING,
event_type STRING,
event_time_str STRING
)
STORED AS PARQUET;
-- Step 2: Insert Sample Data
-- Fraud cases
INSERT INTO fraud_reference VALUES
('user001', 'LOGIN', '2024/01/01 12:00:00'),
('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both
fraud and non-fraud)
INSERT INTO user_logs VALUES
('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');
-- Step 3: Run the Problematic Query
WITH CTE1 AS (
SELECT t.*
FROM user_logs t
INNER JOIN (
SELECT
user_id,
CAST(event_type AS STRING) AS event_type,
TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
FROM fraud_reference
) fr
ON t.user_id = fr.user_id
AND t.event_type = fr.event_type
AND t.event_time = fr.event_time
),CTE2 AS (
SELECT t.*
FROM user_logs t
LEFT ANTI JOIN (
SELECT
user_id,
CAST(event_type AS STRING) AS event_type,
TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
FROM fraud_reference
) fr
ON t.user_id = fr.user_id
AND t.event_type = fr.event_type
AND t.event_time = fr.event_time
WHERE rand() < 0.9
LIMIT 1000000
)-- UNION ALL: causes unexpected row loss
SELECT *, 1 AS label FROM CTE1
UNION ALL
SELECT *, 0 AS label FROM CTE2;{code}
{{SELECT *, 1 AS label FROM CTE1}}
{{UNION ALL}}
{{SELECT *, 0 AS label FROM CTE2;}}
We expect to retrieve around 1,070,000 rows. However, the actual result
contains only about 1,000,600 rows.
{*}Additional Information{*}:
* Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.
* Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.
* Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually
returns the expected number of rows.
* This behavior is reminiscent of the issue described in IMPALA-10182, which
was reportedly fixed in version 4.0.0. However, we are experiencing similar
symptoms.
{*}Steps to Reproduce{*}:
# Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.
# Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.
# Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.
# Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing label
to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).
# Observe the total number of rows returned.
{*}Expected Result{*}:
Approximately 1,070,000 rows.
{*}Actual Result{*}:
* CTE2 correctly returns 1,000,000 rows
* {{CTE1}} unexpectedly returns *only ~600 rows*
* The final {{UNION ALL}} output contains {*}~1,000,600 rows{*}, which is
*missing ~69,400 rows* from the CTE1
{*}Environment{*}:
* Impala Version: 4.0.0.7.1.9.3-4 RELEASE
* Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
* Cluster Configuration: [Provide details if relevant]
I would appreciate any insights into whether this is a regression of the
previously resolved issue or a new bug. Please let me know if further
information is required.
was:
{*}Description{*}:
In Impala version 4.0.0, we are observing an unexpected reduction in the number
of rows returned when performing a {{UNION ALL}} operation between two Common
Table Expressions (CTEs). Specifically:
* {{CTE1}} returns approximately 70,000 rows.
* {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}}
clause to restrict the number of rows due to the large dataset).
When executing the following query:
{code:java}
-- create database if not exists as_adventure;
-- Step 1: Create Tables
-- Table 1: Main user activity logs
CREATE TABLE user_logs (
user_id STRING,
event_type STRING,
event_time TIMESTAMP
)
STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
CREATE TABLE fraud_reference (
user_id STRING,
event_type STRING,
event_time_str STRING
)
STORED AS PARQUET;
-- Step 2: Insert Sample Data
-- Fraud cases
INSERT INTO fraud_reference VALUES
('user001', 'LOGIN', '2024/01/01 12:00:00'),
('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both
fraud and non-fraud)
INSERT INTO user_logs VALUES
('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');
-- Step 3: Run the Problematic Query
WITH CTE1 AS (
SELECT t.*
FROM user_logs t
INNER JOIN (
SELECT
user_id,
CAST(event_type AS STRING) AS event_type,
TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
FROM fraud_reference
) fr
ON t.user_id = fr.user_id
AND t.event_type = fr.event_type
AND t.event_time = fr.event_time
),CTE2 AS (
SELECT t.*
FROM user_logs t
LEFT ANTI JOIN (
SELECT
user_id,
CAST(event_type AS STRING) AS event_type,
TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
FROM fraud_reference
) fr
ON t.user_id = fr.user_id
AND t.event_type = fr.event_type
AND t.event_time = fr.event_time
WHERE rand() < 0.9
LIMIT 1000000
)-- UNION ALL: causes unexpected row loss
SELECT *, 1 AS label FROM CTE1
UNION ALL
SELECT *, 0 AS label FROM CTE2;{code}
{{SELECT *, 1 AS label FROM CTE1}}
{{UNION ALL}}
{{SELECT *, 0 AS label FROM CTE2;}}
We expect to retrieve around 1,070,000 rows. However, the actual result
contains only about 1,000,600 rows.
{*}Additional Information{*}:
* Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.
* Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.
* Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually
returns the expected number of rows.
* This behavior is reminiscent of the issue described in IMPALA-10182, which
was reportedly fixed in version 4.0.0. However, we are experiencing similar
symptoms.
{*}Steps to Reproduce{*}:
# Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.
# Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.
# Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.
# Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing label
to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).
# Observe the total number of rows returned.
{*}Expected Result{*}:
Approximately 1,070,000 rows.
{*}Actual Result{*}:
Approximately 1,000,600 rows.
{*}Environment{*}:
* Impala Version: 4.0.0.7.1.9.3-4 RELEASE
* Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
* Cluster Configuration: [Provide details if relevant]
I would appreciate any insights into whether this is a regression of the
previously resolved issue or a new bug. Please let me know if further
information is required.
> Unexpected row count reduction when using UNION ALL with CTEs in Impala 4.0.0
> -----------------------------------------------------------------------------
>
> Key: IMPALA-14084
> URL: https://issues.apache.org/jira/browse/IMPALA-14084
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.0.0
> Reporter: SEUNGJIN KWON
> Priority: Blocker
> Fix For: Impala 4.0.0
>
>
> {*}Description{*}:
> In Impala version 4.0.0, we are observing an unexpected reduction in the
> number of rows returned when performing a {{UNION ALL}} operation between two
> Common Table Expressions (CTEs). Specifically:
> * {{CTE1}} returns approximately 70,000 rows.
> * {{CTE2}} returns approximately 1,000,000 rows (with a {{LIMIT 1000000}}
> clause to restrict the number of rows due to the large dataset).
> When executing the following query:
>
> {code:java}
> -- create database if not exists as_adventure;
> -- Step 1: Create Tables
> -- Table 1: Main user activity logs
> CREATE TABLE user_logs (
> user_id STRING,
> event_type STRING,
> event_time TIMESTAMP
> )
> STORED AS PARQUET;-- Table 2: Known fraudulent events (as strings)
> CREATE TABLE fraud_reference (
> user_id STRING,
> event_type STRING,
> event_time_str STRING
> )
> STORED AS PARQUET;
> -- Step 2: Insert Sample Data
> -- Fraud cases
> INSERT INTO fraud_reference VALUES
> ('user001', 'LOGIN', '2024/01/01 12:00:00'),
> ('user002', 'TRANSFER', '2024/01/02 13:30:00');-- User activity logs (both
> fraud and non-fraud)
> INSERT INTO user_logs VALUES
> ('user001', 'LOGIN', TIMESTAMP '2024-01-01 12:00:00'),
> ('user002', 'TRANSFER', TIMESTAMP '2024-01-02 13:30:00'),
> ('user003', 'LOGIN', TIMESTAMP '2024-01-03 08:15:00'),
> ('user004', 'TRANSFER', TIMESTAMP '2024-01-03 09:45:00'),
> ('user005', 'LOGIN', TIMESTAMP '2024-01-04 10:00:00'),
> ('user006', 'TRANSFER', TIMESTAMP '2024-01-05 11:00:00');
> -- Step 3: Run the Problematic Query
> WITH CTE1 AS (
> SELECT t.*
> FROM user_logs t
> INNER JOIN (
> SELECT
> user_id,
> CAST(event_type AS STRING) AS event_type,
> TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
> FROM fraud_reference
> ) fr
> ON t.user_id = fr.user_id
> AND t.event_type = fr.event_type
> AND t.event_time = fr.event_time
> ),CTE2 AS (
> SELECT t.*
> FROM user_logs t
> LEFT ANTI JOIN (
> SELECT
> user_id,
> CAST(event_type AS STRING) AS event_type,
> TO_TIMESTAMP(event_time_str, 'yyyy/MM/dd HH:mm:ss') AS event_time
> FROM fraud_reference
> ) fr
> ON t.user_id = fr.user_id
> AND t.event_type = fr.event_type
> AND t.event_time = fr.event_time
> WHERE rand() < 0.9
> LIMIT 1000000
> )-- UNION ALL: causes unexpected row loss
> SELECT *, 1 AS label FROM CTE1
> UNION ALL
> SELECT *, 0 AS label FROM CTE2;{code}
>
> {{SELECT *, 1 AS label FROM CTE1}}
> {{UNION ALL}}
> {{SELECT *, 0 AS label FROM CTE2;}}
> We expect to retrieve around 1,070,000 rows. However, the actual result
> contains only about 1,000,600 rows.
> {*}Additional Information{*}:
> * Both {{CTE1}} and {{CTE2}} have identical schemas and column counts.
> * Removing the {{LIMIT}} clause from {{CTE2}} does not resolve the issue.
> * Executing the {{SELECT}} statements for {{CTE1}} and {{CTE2}} individually
> returns the expected number of rows.
> * This behavior is reminiscent of the issue described in IMPALA-10182, which
> was reportedly fixed in version 4.0.0. However, we are experiencing similar
> symptoms.
> {*}Steps to Reproduce{*}:
> # Define two CTEs ({{{}CTE1{}}} and {{{}CTE2{}}}) with identical schemas.
> # Ensure {{CTE1}} returns ~70,000 rows and {{CTE2}} returns ~1,000,000 rows.
> # Apply a {{LIMIT 1000000}} clause to {{{}CTE2{}}}.
> # Execute a {{UNION ALL}} between the two CTEs, adding a distinguishing
> label to each ({{{}1{}}} for {{{}CTE1{}}}, {{0}} for {{{}CTE2{}}}).
> # Observe the total number of rows returned.
> {*}Expected Result{*}:
> Approximately 1,070,000 rows.
> {*}Actual Result{*}:
> * CTE2 correctly returns 1,000,000 rows
> * {{CTE1}} unexpectedly returns *only ~600 rows*
> * The final {{UNION ALL}} output contains {*}~1,000,600 rows{*}, which is
> *missing ~69,400 rows* from the CTE1
> {*}Environment{*}:
> * Impala Version: 4.0.0.7.1.9.3-4 RELEASE
> * Execution Interface: [Specify if using Impala Shell, Hue, JDBC, etc.]
> * Cluster Configuration: [Provide details if relevant]
> I would appreciate any insights into whether this is a regression of the
> previously resolved issue or a new bug. Please let me know if further
> information is required.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]