GWphua opened a new issue, #19344:
URL: https://github.com/apache/druid/issues/19344

   ### Affected Version
   
   If SQL Null Handling feature is turned on:
   - Default since v28
   - Since v17 with feature flag enabled
   
   ### Description
   I am currently running v27 image on my clusters. As part of process to 
upgrade to later versions, I tried to test out the SQL Null-Compatible feature. 
   
   For the case where my table does not have any null values, I come across a 
performance degradation.
   
   The following benchmark is conducted for a cluster with the same resource. 
It is found that the Historical + Peon querying latency is a contributor for 
the difference in querying speed.
   
   #### Cache Disabled
   
   | Description             | V27 (JDK17, Null Disabled) | V27 (JDK17, Null 
Enabled) | V37 (JDK17) | V37 (JDK21) |
   | ----------------------- | -------------------------: | 
------------------------: | ----------: | ----------: |
   | Select 1 Random Row     |                      5.455 |                     
5.142 |       5.416 |       5.414 |
   | Select Count (1h)       |                      5.446 |                     
5.154 |       5.428 |       5.469 |
   | Cardinality (1h Unique) |                      6.600 |                     
6.056 |       6.003 |       5.808 |
   | Time-Series (12h PT5M)  |                      5.592 |                     
5.623 |       7.769 |       5.552 |
   | Heavy Scan (24h Agg)    |                     14.957 |                    
16.325 |      19.847 |      11.830 |
   | TopN Ranking (24h)      |                     12.159 |                    
17.245 |      18.903 |      16.707 |
   
   #### Cache Enabled
   
   | Description             | V27 (JDK17, Null Disabled) | V27 (JDK17, Null 
Enabled) | V37 (JDK17) | V37 (JDK21) |
   | ----------------------- | -------------------------: | 
------------------------: | ----------: | ----------: |
   | Select 1 Random Row     |                      5.318 |                     
5.198 |       5.288 |       5.433 |
   | Select Count (1h)       |                      5.347 |                     
5.215 |       5.317 |       5.504 |
   | Cardinality (1h Unique) |                      7.795 |                     
6.119 |       6.086 |       5.844 |
   | Time-Series (12h PT5M)  |                      5.643 |                     
5.362 |       5.392 |       5.550 |
   | Heavy Scan (24h Agg)    |                      5.441 |                     
5.686 |       5.412 |       5.581 |
   | TopN Ranking (24h)      |                      5.628 |                     
6.170 |       6.076 |       6.237 |
   
   
   <details>
   <summary>SQL Statements</summary>
   
   #### Select 1 Random Row
   
   ```sql
   SELECT __time, id
   FROM "<TABLE_NAME>"
   ORDER BY __time DESC
   LIMIT 1;
   ```
   
   #### Select Count (1h)
   
   ```sql
   SELECT COUNT(*)
   FROM "<TABLE_NAME>"
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR;
   ```
   
   #### Cardinality (1h Unique)
   
   ```sql
   SELECT COUNT(DISTINCT id)
   FROM "<TABLE_NAME>"
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR;
   ```
   
   #### Time-Series (12h PT5M)
   
   ```sql
   SELECT TIME_FLOOR(__time, 'PT5M') AS bucket_5m, idc, SUM("count") AS 
total_count
   FROM "<TABLE_NAME>"
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' HOUR
   GROUP BY 1, 2;
   ```
   
   #### Heavy Scan (24h Agg)
   
   ```sql
   SELECT idc, SUM("count") AS total_count, SUM(sum_rtt) / SUM("count") AS 
avg_rtt
   FROM "<TABLE_NAME>"
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
   GROUP BY 1;
   ```
   
   #### TopN Ranking (24h)
   
   ```sql
   SELECT id, SUM(sum_rtt) AS total_sum_rtt
   FROM "<TABLE_NAME>"
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
   GROUP BY 1
   ORDER BY 2 DESC
   LIMIT 100;
   ```
   </details>
   
   ### Cause Analysis
   When enabling null handling, we will need an extra check to deduce whether 
the current row in the column is null. While the `isNull` method is cheap, it 
is called for every single row.
   
   
   ```java
   // NullableNumericTopNColumnAggregatesProcessor.java
   
   if (hasNulls && selector.isNull()) { // V27
   if (selector.isNull()) { // V37
     // REDACTED for brevity
   } else {
     Aggregator[] valueAggregates = getValueAggregators(query, selector, 
cursor);
     for (Aggregator aggregator : valueAggregates) {
       aggregator.aggregate();
     }
   }
   ```
   
   Asking AI to estimate the latency for each `isNull` method, for a long 
value, the execution time is 1~3ns. This may be fast, but since the check is 
run for every row, this resulted in substantial increase for querying time. 
   
   For example, the total number of rows in 1 day for the TopN / Heavy Scan 
query is 2,147,995,191. If we estimate each operation to take 1~3ns, this 
matches expectations of the query being 2~6s faster. 
   
   ### Suggestions
   Regarding suggestions for the fix, please verify if these are feasible:
   1. Store something in the metadata of segments to indicate which columns 
will not have any null values? We can then provide non-null implementations 
that fast-track to processing.
   2. Some work on the compaction side, prioritize compacting segments with 
null values together if possible?
   3. Other suggestions / feedback?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to