Re: [PR] feat: clustered segment benchmark and fix (druid)

via GitHub Tue, 23 Jun 2026 18:15:06 -0700


github-advanced-security[bot] commented on code in PR #19623:
URL: https://github.com/apache/druid/pull/19623#discussion_r3463931375



##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlBenchmarkDatasets.java:
##########
@@ -319,9 +328,113 @@
 
   public static BenchmarkSchema getSchema(String dataset)
   {
+    if (dataset.startsWith(CLUSTERING + "_")) {
+      return makeClusteringSchema(dataset);
+    }
     return DATASET_SCHEMAS.get(dataset);
   }
 
+  /**
+   * Builds the structured datasource (table) name for a clustering benchmark 
dataset, encoding the segment layout and
+   * the clustering parameters so {@link #getSchema(String)} can reconstruct 
the {@link BenchmarkSchema} on demand. The
+   * encoded params keep the shared benchmark setup + static-map architecture 
intact while still allowing the
+   * clustering cardinality and number of clustering columns to be swept as 
JMH parameters.
+   * <p>
+   * {@code layout} is one of {@code CLUSTERED}, {@code UNCLUSTERED}, or 
{@code TIME_ORDERED} (see
+   * {@link #makeClusteringSchema}).
+   */
+  public static String clusteringDatasource(String layout, int 
clusteringCardinality, int numClusteringColumns)
+  {
+    return StringUtils.format(
+        "%s_%s_c%d_k%d",
+        CLUSTERING,
+        layout,
+        clusteringCardinality,
+        numClusteringColumns
+    );
+  }
+
+  /**
+   * Builds a clustering {@link BenchmarkSchema} from a datasource name 
produced by
+   * {@link #clusteringDatasource(String, int, int)}. The same generated data 
is laid out three ways for comparison:
+   * <ul>
+   *   <li>{@code CLUSTERED}: V10 clustered base-table segment, partitioned 
into per-clustering-value groups, ordered
+   *       clustering columns &rarr; {@code __time} &rarr; non-clustering 
columns (the intended real-world layout);</li>
+   *   <li>{@code UNCLUSTERED}: V10 regular segment with the <em>same</em> 
clustering-first ordering but no clustered
+   *       grouping &mdash; isolates the cost/benefit of the clustered 
grouping mechanism at a fixed sort order;</li>
+   *   <li>{@code TIME_ORDERED}: regular (V9) segment with the 
production-style {@code __time}-first ordering
+   *       ({@code __time} &rarr; clustering columns &rarr; non-clustering 
columns) &mdash; the "what we run today"
+   *       baseline the clustered layout is being compared against.</li>
+   * </ul>
+   */
+  private static BenchmarkSchema makeClusteringSchema(String dataset)
+  {
+    // Encoded as clustering_<LAYOUT>_c<card>_k<k>. The layout may itself 
contain underscores (e.g. TIME_ORDERED), so
+    // parse the trailing c<card>/k<k> positionally and treat everything 
between the prefix and them as the layout.
+    final String[] parts = dataset.split("_");
+    if (parts.length < 4) {
+      throw new IAE("Malformed clustering datasource[%s]", dataset);
+    }
+    final int numClusteringColumns = Integer.parseInt(parts[parts.length - 
1].substring(1));

Review Comment:
   ## CodeQL / Missing catch of NumberFormatException
   
   Potential uncaught 'java.lang.NumberFormatException'.
   
   [Show more 
details](https://github.com/apache/druid/security/code-scanning/11320)



##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlBenchmarkDatasets.java:
##########
@@ -319,9 +328,113 @@
 
   public static BenchmarkSchema getSchema(String dataset)
   {
+    if (dataset.startsWith(CLUSTERING + "_")) {
+      return makeClusteringSchema(dataset);
+    }
     return DATASET_SCHEMAS.get(dataset);
   }
 
+  /**
+   * Builds the structured datasource (table) name for a clustering benchmark 
dataset, encoding the segment layout and
+   * the clustering parameters so {@link #getSchema(String)} can reconstruct 
the {@link BenchmarkSchema} on demand. The
+   * encoded params keep the shared benchmark setup + static-map architecture 
intact while still allowing the
+   * clustering cardinality and number of clustering columns to be swept as 
JMH parameters.
+   * <p>
+   * {@code layout} is one of {@code CLUSTERED}, {@code UNCLUSTERED}, or 
{@code TIME_ORDERED} (see
+   * {@link #makeClusteringSchema}).
+   */
+  public static String clusteringDatasource(String layout, int 
clusteringCardinality, int numClusteringColumns)
+  {
+    return StringUtils.format(
+        "%s_%s_c%d_k%d",
+        CLUSTERING,
+        layout,
+        clusteringCardinality,
+        numClusteringColumns
+    );
+  }
+
+  /**
+   * Builds a clustering {@link BenchmarkSchema} from a datasource name 
produced by
+   * {@link #clusteringDatasource(String, int, int)}. The same generated data 
is laid out three ways for comparison:
+   * <ul>
+   *   <li>{@code CLUSTERED}: V10 clustered base-table segment, partitioned 
into per-clustering-value groups, ordered
+   *       clustering columns &rarr; {@code __time} &rarr; non-clustering 
columns (the intended real-world layout);</li>
+   *   <li>{@code UNCLUSTERED}: V10 regular segment with the <em>same</em> 
clustering-first ordering but no clustered
+   *       grouping &mdash; isolates the cost/benefit of the clustered 
grouping mechanism at a fixed sort order;</li>
+   *   <li>{@code TIME_ORDERED}: regular (V9) segment with the 
production-style {@code __time}-first ordering
+   *       ({@code __time} &rarr; clustering columns &rarr; non-clustering 
columns) &mdash; the "what we run today"
+   *       baseline the clustered layout is being compared against.</li>
+   * </ul>
+   */
+  private static BenchmarkSchema makeClusteringSchema(String dataset)
+  {
+    // Encoded as clustering_<LAYOUT>_c<card>_k<k>. The layout may itself 
contain underscores (e.g. TIME_ORDERED), so
+    // parse the trailing c<card>/k<k> positionally and treat everything 
between the prefix and them as the layout.
+    final String[] parts = dataset.split("_");
+    if (parts.length < 4) {
+      throw new IAE("Malformed clustering datasource[%s]", dataset);
+    }
+    final int numClusteringColumns = Integer.parseInt(parts[parts.length - 
1].substring(1));
+    final int clusteringCardinality = Integer.parseInt(parts[parts.length - 
2].substring(1));

Review Comment:
   ## CodeQL / Missing catch of NumberFormatException
   
   Potential uncaught 'java.lang.NumberFormatException'.
   
   [Show more 
details](https://github.com/apache/druid/security/code-scanning/11321)



##########
processing/src/main/java/org/apache/druid/segment/projections/SingleGroupClusteringColumnSelectorFactory.java:
##########
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.projections;
+
+import org.apache.druid.error.DruidException;
+import org.apache.druid.math.expr.Evals;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExpressionType;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.ColumnSelectorFactory;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.ConstantExprEvalSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.RowIdSupplier;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnCapabilitiesImpl;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.RowSignature;
+import org.apache.druid.segment.column.ValueType;
+
+import javax.annotation.Nullable;
+
+/**
+ * Single-cluster-group counterpart of {@link 
ClusteringColumnSelectorFactory}, used when a query prunes a clustered
+ * {@link org.apache.druid.segment.QueryableIndexCursorFactory}). 
Non-clustering columns are delegated directly to
+ * the per-group factory; clustering columns, which are not physically stored 
in the per-group data, are surfaced as
+ * constants from the group's clustering tuple.
+ */
+public class SingleGroupClusteringColumnSelectorFactory implements 
ColumnSelectorFactory
+{
+  private final ColumnSelectorFactory delegate;
+  private final RowSignature clusteringColumns;
+  private final Object[] clusteringValues;
+
+  public SingleGroupClusteringColumnSelectorFactory(
+      ColumnSelectorFactory delegate,
+      RowSignature clusteringColumns,
+      Object[] clusteringValues
+  )
+  {
+    if (clusteringValues == null || clusteringValues.length != 
clusteringColumns.size()) {
+      throw DruidException.defensive(
+          "clusteringValues length [%s] must match clusteringColumns size 
[%s]",
+          clusteringValues == null ? "null" : clusteringValues.length,
+          clusteringColumns.size()
+      );
+    }
+    this.delegate = delegate;
+    this.clusteringColumns = clusteringColumns;
+    this.clusteringValues = clusteringValues;
+  }
+
+  @Override
+  public DimensionSelector makeDimensionSelector(DimensionSpec dimensionSpec)
+  {
+    final int idx = clusteringColumns.indexOf(dimensionSpec.getDimension());
+    if (idx < 0) {
+      return delegate.makeDimensionSelector(dimensionSpec);
+    }
+    final Object raw = clusteringValues[idx];
+    return DimensionSelector.constant(Evals.asString(raw), 
dimensionSpec.getExtractionFn());

Review Comment:
   ## CodeQL / Deprecated method or constructor invocation
   
   Invoking [DimensionSpec.getExtractionFn](1) should be avoided because it has 
been deprecated.
   
   [Show more 
details](https://github.com/apache/druid/security/code-scanning/11319)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: clustered segment benchmark and fix (druid)

Reply via email to