gammacomputer opened a new issue, #19270:
URL: https://github.com/apache/druid/issues/19270

   
   
   ### Affected Version
   
   36.0.0
   
   ### Description
   
   A timeseries query with a non-ALL granularity (e.g., DAY) over an 
UnnestDataSource effectively hangs when the query interval is very wide (e.g., 
ETERNITY). The equivalent groupBy query returns instantly.
   
   This affects SQL queries using FLOOR(__time TO DAY) (or similar time floor 
expressions like TIME_FLOOR or DATE_TRUNC probably) combined with CROSS JOIN 
UNNEST, as the SQL planner converts these into timeseries queries with DAY 
granularity.
   
   An LLM found the following root cause for the issue (this should be 
verified, I do not have enough knowledge to judge if it's correct):
   
   CursorGranularizer.create() clips the query interval using 
timeBoundaryInspector when available. For unnest datasources where the 
inspector is absent or the segment boundaries are wide, the ETERNITY interval
      is not clipped, causing granularity.getIterable() to generate billions of 
day-sized buckets. The engine then iterates over all of them, even though 
skipEmptyBuckets is set -- it still calls advanceToBucket() for each one.
   
   The SQL planner's canUseQueryGranularity() check (DruidQuery.java) is meant 
to prevent this for non-table datasources by rejecting ETERNITY intervals. 
However, ExecutionVertex.isTableBased() traverses through UnnestDataSource down 
to the underlying TableDataSource, returning true and bypassing the safety 
check.
   
   Reproduction
   
   Submit this native query to /druid/v2 -- no datasource creation needed:
   
     {
       "queryType": "timeseries",
       "dataSource": {
         "type": "unnest",
         "base": {
           "type": "inline",
           "columnNames": ["__time", "vals"],
           "columnTypes": ["LONG", "ARRAY<LONG>"],
           "rows": [
             [1704067200000, [10, 20, 30]],
             [1704153600000, [40, 50, 60]]
           ]
         },
         "virtualColumn": {
           "type": "expression",
           "name": "unnested",
           "expression": "\"vals\"",
           "outputType": "ARRAY<LONG>"
         }
       },
       "intervals": 
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z",
       "granularity": "DAY",
       "aggregations": [
         { "type": "count", "name": "cnt" }
       ],
       "context": {
         "skipEmptyBuckets": true,
         "timeout": 10000
       }
     }
   
   Expected: Returns results (2 day buckets, 3 rows each).
   Actual: Times out.
   
   Narrowing intervals to "2024-01-01/2024-01-03" returns instantly. The 
equivalent groupBy query also returns instantly.
   
   SQL Reproduction
   
     -- This times out:
     SELECT
       FLOOR(__time TO DAY) AS aggregation,
       AVG(CAST(statistic AS DOUBLE)) AS statistic
     FROM (
         SELECT __time, vals AS arr
         FROM (
           VALUES
             (TIMESTAMP '2024-01-01', ARRAY[10, 20, 30]),
             (TIMESTAMP '2024-01-02', ARRAY[40, 50, 60])
         ) AS t (__time, vals)
     ) CROSS JOIN UNNEST(arr) AS d (statistic)
     GROUP BY FLOOR(__time TO DAY)
     ORDER BY aggregation
   
     -- This works instantly (no time floor, uses groupBy):
     SELECT
       __time AS aggregation,
       AVG(CAST(statistic AS DOUBLE)) AS statistic
     FROM (
         SELECT __time, vals AS arr
         FROM (
           VALUES
             (TIMESTAMP '2024-01-01', ARRAY[10, 20, 30]),
             (TIMESTAMP '2024-01-02', ARRAY[40, 50, 60])
         ) AS t (__time, vals)
     ) CROSS JOIN UNNEST(arr) AS d (statistic)
     GROUP BY __time
     ORDER BY aggregation
   
   The only difference is FLOOR(__time TO DAY) vs raw __time. The former 
triggers a timeseries plan with DAY granularity; the latter triggers a groupBy 
plan that completes immediately.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to