gammacomputer opened a new issue, #19270:
URL: https://github.com/apache/druid/issues/19270
### Affected Version
36.0.0
### Description
A timeseries query with a non-ALL granularity (e.g., DAY) over an
UnnestDataSource effectively hangs when the query interval is very wide (e.g.,
ETERNITY). The equivalent groupBy query returns instantly.
This affects SQL queries using FLOOR(__time TO DAY) (or similar time floor
expressions like TIME_FLOOR or DATE_TRUNC probably) combined with CROSS JOIN
UNNEST, as the SQL planner converts these into timeseries queries with DAY
granularity.
An LLM found the following root cause for the issue (this should be
verified, I do not have enough knowledge to judge if it's correct):
CursorGranularizer.create() clips the query interval using
timeBoundaryInspector when available. For unnest datasources where the
inspector is absent or the segment boundaries are wide, the ETERNITY interval
is not clipped, causing granularity.getIterable() to generate billions of
day-sized buckets. The engine then iterates over all of them, even though
skipEmptyBuckets is set -- it still calls advanceToBucket() for each one.
The SQL planner's canUseQueryGranularity() check (DruidQuery.java) is meant
to prevent this for non-table datasources by rejecting ETERNITY intervals.
However, ExecutionVertex.isTableBased() traverses through UnnestDataSource down
to the underlying TableDataSource, returning true and bypassing the safety
check.
Reproduction
Submit this native query to /druid/v2 -- no datasource creation needed:
{
"queryType": "timeseries",
"dataSource": {
"type": "unnest",
"base": {
"type": "inline",
"columnNames": ["__time", "vals"],
"columnTypes": ["LONG", "ARRAY<LONG>"],
"rows": [
[1704067200000, [10, 20, 30]],
[1704153600000, [40, 50, 60]]
]
},
"virtualColumn": {
"type": "expression",
"name": "unnested",
"expression": "\"vals\"",
"outputType": "ARRAY<LONG>"
}
},
"intervals":
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z",
"granularity": "DAY",
"aggregations": [
{ "type": "count", "name": "cnt" }
],
"context": {
"skipEmptyBuckets": true,
"timeout": 10000
}
}
Expected: Returns results (2 day buckets, 3 rows each).
Actual: Times out.
Narrowing intervals to "2024-01-01/2024-01-03" returns instantly. The
equivalent groupBy query also returns instantly.
SQL Reproduction
-- This times out:
SELECT
FLOOR(__time TO DAY) AS aggregation,
AVG(CAST(statistic AS DOUBLE)) AS statistic
FROM (
SELECT __time, vals AS arr
FROM (
VALUES
(TIMESTAMP '2024-01-01', ARRAY[10, 20, 30]),
(TIMESTAMP '2024-01-02', ARRAY[40, 50, 60])
) AS t (__time, vals)
) CROSS JOIN UNNEST(arr) AS d (statistic)
GROUP BY FLOOR(__time TO DAY)
ORDER BY aggregation
-- This works instantly (no time floor, uses groupBy):
SELECT
__time AS aggregation,
AVG(CAST(statistic AS DOUBLE)) AS statistic
FROM (
SELECT __time, vals AS arr
FROM (
VALUES
(TIMESTAMP '2024-01-01', ARRAY[10, 20, 30]),
(TIMESTAMP '2024-01-02', ARRAY[40, 50, 60])
) AS t (__time, vals)
) CROSS JOIN UNNEST(arr) AS d (statistic)
GROUP BY __time
ORDER BY aggregation
The only difference is FLOOR(__time TO DAY) vs raw __time. The former
triggers a timeseries plan with DAY granularity; the latter triggers a groupBy
plan that completes immediately.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]