Re: [PR] Add Kafka ingestion support for subset partitions [pinot]

via GitHub Mon, 23 Feb 2026 07:37:54 -0800


Copilot commented on code in PR #17587:
URL: https://github.com/apache/pinot/pull/17587#discussion_r2841534901



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-base/src/main/java/org/apache/pinot/plugin/stream/kafka/KafkaPartitionSubsetUtils.java:
##########
@@ -0,0 +1,79 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.stream.kafka;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nullable;
+import org.apache.commons.lang3.StringUtils;
+
+
+/**
+ * Utilities for parsing and validating Kafka partition subset configuration
+ * (stream.kafka.partition.ids) from stream config.
+ */
+public final class KafkaPartitionSubsetUtils {
+
+  private KafkaPartitionSubsetUtils() {
+  }
+
+  /**
+   * Reads the optional comma-separated partition ID list from the stream 
config map.
+   * Returns a sorted, deduplicated list for stable ordering when used for 
partition group metadata.
+   * Duplicate IDs in the config are silently removed to avoid duplicate 
processing of the same partition.
+   *
+   * @param streamConfigMap table stream config map (e.g. from
+   *                        {@link 
org.apache.pinot.spi.stream.StreamConfig#getStreamConfigsMap()})
+   * @return Sorted list of unique partition IDs when 
stream.kafka.partition.ids is set and non-empty;
+   *         null when not set or blank
+   * @throws IllegalArgumentException if the value contains invalid 
(non-integer) entries
+   */
+  @Nullable
+  public static List<Integer> getPartitionIdsFromConfig(Map<String, String> 
streamConfigMap) {
+    String key = 
KafkaStreamConfigProperties.constructStreamProperty(KafkaStreamConfigProperties.PARTITION_IDS);
+    String value = streamConfigMap.get(key);
+    if (StringUtils.isBlank(value)) {
+      return null;
+    }
+    String[] parts = value.split(",");
+    Set<Integer> idSet = new HashSet<>(parts.length);
+    for (String part : parts) {
+      String trimmed = part.trim();
+      if (trimmed.isEmpty()) {
+        continue;
+      }
+      try {
+        idSet.add(Integer.parseInt(trimmed));
+      } catch (NumberFormatException e) {
+        throw new IllegalArgumentException(
+            "Invalid " + key + " value: expected comma-separated integers, got 
'" + value + "'", e);
+      }
+    }
+    if (idSet.isEmpty()) {
+      return null;
+    }
+    List<Integer> ids = new ArrayList<>(idSet);
+    Collections.sort(ids);
+    return ids;
+  }

Review Comment:
   The method accepts negative partition IDs without validation. Kafka 
partition IDs are always non-negative integers. Consider adding validation to 
reject negative values and throw an IllegalArgumentException with a clear 
message, such as "Partition IDs must be non-negative". This will provide early 
feedback if a user mistakenly configures negative values like "-1,0,1".



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/main/java/org/apache/pinot/plugin/stream/kafka40/KafkaStreamMetadataProvider.java:
##########
@@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends 
KafkaPartitionLevelConnectionHa
     implements StreamMetadataProvider {
 
   private static final Logger LOGGER = 
LoggerFactory.getLogger(KafkaStreamMetadataProvider.class);
+  /** Whether this table consumes only a subset of topic partitions (from 
stream.kafka.partition.ids). */
+  private final boolean _partialPartitions;
+  /**
+   * Immutable partition ID subset from table config. Read once at 
construction; does not change during the
+   * provider's lifetime. To change the subset, update the table config and 
restart the consumer.
+   */
+  private final List<Integer> _partitionIdSubset;
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig) {
     this(clientId, streamConfig, Integer.MIN_VALUE);
   }
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig, int partition) {
     super(clientId, streamConfig, partition);
+    List<Integer> subset =
+        
KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap());
+    if (subset != null) {
+      // The partition subset comes from the table config and is expected to 
remain stable until config update.
+      _partialPartitions = true;
+      _partitionIdSubset = Collections.unmodifiableList(subset);
+    } else {
+      _partialPartitions = false;
+      _partitionIdSubset = Collections.emptyList();
+    }
+  }

Review Comment:
   The PR description mentions "validate configured IDs against topic 
metadata", but the implementation doesn't validate that the configured 
partition IDs actually exist in the Kafka topic. While implicit validation will 
occur when attempting to consume (resulting in runtime errors), early 
validation during metadata provider construction could provide clearer 
feedback. Consider adding validation that fetches topic partition metadata and 
verifies all configured partition IDs exist, or update the PR description if 
this validation was intentionally deferred to runtime.



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/main/java/org/apache/pinot/plugin/stream/kafka30/KafkaStreamMetadataProvider.java:
##########
@@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends 
KafkaPartitionLevelConnectionHa
     implements StreamMetadataProvider {
 
   private static final Logger LOGGER = 
LoggerFactory.getLogger(KafkaStreamMetadataProvider.class);
+  /** Whether this table consumes only a subset of topic partitions (from 
stream.kafka.partition.ids). */
+  private final boolean _partialPartitions;
+  /**
+   * Immutable partition ID subset from table config. Read once at 
construction; does not change during the
+   * provider's lifetime. To change the subset, update the table config and 
restart the consumer.
+   */
+  private final List<Integer> _partitionIdSubset;
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig) {
     this(clientId, streamConfig, Integer.MIN_VALUE);
   }
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig, int partition) {
     super(clientId, streamConfig, partition);
+    List<Integer> subset =
+        
KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap());
+    if (subset != null) {
+      // The partition subset comes from the table config and is expected to 
remain stable until config update.
+      _partialPartitions = true;
+      _partitionIdSubset = Collections.unmodifiableList(subset);
+    } else {
+      _partialPartitions = false;
+      _partitionIdSubset = Collections.emptyList();
+    }
+  }

Review Comment:
   The PR description mentions "validate configured IDs against topic 
metadata", but the implementation doesn't validate that the configured 
partition IDs actually exist in the Kafka topic. While implicit validation will 
occur when attempting to consume (resulting in runtime errors), early 
validation during metadata provider construction could provide clearer 
feedback. Consider adding validation that fetches topic partition metadata and 
verifies all configured partition IDs exist, or update the PR description if 
this validation was intentionally deferred to runtime.



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/main/java/org/apache/pinot/plugin/stream/kafka30/KafkaStreamMetadataProvider.java:
##########
@@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends 
KafkaPartitionLevelConnectionHa
     implements StreamMetadataProvider {
 
   private static final Logger LOGGER = 
LoggerFactory.getLogger(KafkaStreamMetadataProvider.class);
+  /** Whether this table consumes only a subset of topic partitions (from 
stream.kafka.partition.ids). */
+  private final boolean _partialPartitions;
+  /**
+   * Immutable partition ID subset from table config. Read once at 
construction; does not change during the
+   * provider's lifetime. To change the subset, update the table config and 
restart the consumer.
+   */
+  private final List<Integer> _partitionIdSubset;
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig) {
     this(clientId, streamConfig, Integer.MIN_VALUE);
   }
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig, int partition) {
     super(clientId, streamConfig, partition);
+    List<Integer> subset =
+        
KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap());
+    if (subset != null) {
+      // The partition subset comes from the table config and is expected to 
remain stable until config update.
+      _partialPartitions = true;
+      _partitionIdSubset = Collections.unmodifiableList(subset);
+    } else {
+      _partialPartitions = false;
+      _partitionIdSubset = Collections.emptyList();
+    }
+  }
+
+  private List<PartitionInfo> getPartitionInfos(long timeoutMillis) {
+    List<PartitionInfo> partitionInfos = _consumer.partitionsFor(_topic, 
Duration.ofMillis(timeoutMillis));
+    if (CollectionUtils.isEmpty(partitionInfos)) {
+      throw new RuntimeException("Failed to fetch partition information for 
topic: " + _topic);
+    }
+    return partitionInfos;
+  }
+
+  private Set<Integer> toPartitionIdSet(List<PartitionInfo> partitionInfos) {
+    Set<Integer> partitionIds = 
Sets.newHashSetWithExpectedSize(partitionInfos.size());
+    for (PartitionInfo partitionInfo : partitionInfos) {
+      partitionIds.add(partitionInfo.partition());
+    }
+    return partitionIds;
   }
 

Review Comment:
   These two private methods (`getPartitionInfos` and `toPartitionIdSet`) are 
defined but never used in this class. The `fetchPartitionCount` and 
`fetchPartitionIds` methods call `fetchPartitionInfos` directly instead of 
using these helpers. Consider removing these unused methods to reduce code 
clutter, or if they were intended for future use, document why they're being 
kept.
   ```suggestion
   
   ```



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/main/java/org/apache/pinot/plugin/stream/kafka40/KafkaStreamMetadataProvider.java:
##########
@@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends 
KafkaPartitionLevelConnectionHa
     implements StreamMetadataProvider {
 
   private static final Logger LOGGER = 
LoggerFactory.getLogger(KafkaStreamMetadataProvider.class);
+  /** Whether this table consumes only a subset of topic partitions (from 
stream.kafka.partition.ids). */
+  private final boolean _partialPartitions;
+  /**
+   * Immutable partition ID subset from table config. Read once at 
construction; does not change during the
+   * provider's lifetime. To change the subset, update the table config and 
restart the consumer.
+   */
+  private final List<Integer> _partitionIdSubset;
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig) {
     this(clientId, streamConfig, Integer.MIN_VALUE);
   }
 
   public KafkaStreamMetadataProvider(String clientId, StreamConfig 
streamConfig, int partition) {
     super(clientId, streamConfig, partition);
+    List<Integer> subset =
+        
KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap());
+    if (subset != null) {
+      // The partition subset comes from the table config and is expected to 
remain stable until config update.
+      _partialPartitions = true;
+      _partitionIdSubset = Collections.unmodifiableList(subset);
+    } else {
+      _partialPartitions = false;
+      _partitionIdSubset = Collections.emptyList();
+    }
+  }
+
+  private List<PartitionInfo> getPartitionInfos(long timeoutMillis) {
+    List<PartitionInfo> partitionInfos = _consumer.partitionsFor(_topic, 
Duration.ofMillis(timeoutMillis));
+    if (CollectionUtils.isEmpty(partitionInfos)) {
+      throw new RuntimeException("Failed to fetch partition information for 
topic: " + _topic);
+    }
+    return partitionInfos;
+  }
+
+  private Set<Integer> toPartitionIdSet(List<PartitionInfo> partitionInfos) {
+    Set<Integer> partitionIds = 
Sets.newHashSetWithExpectedSize(partitionInfos.size());
+    for (PartitionInfo partitionInfo : partitionInfos) {
+      partitionIds.add(partitionInfo.partition());
+    }
+    return partitionIds;
   }
 

Review Comment:
   These two private methods (`getPartitionInfos` and `toPartitionIdSet`) are 
defined but never used in this class. The `fetchPartitionCount` and 
`fetchPartitionIds` methods call `fetchPartitionInfos` directly instead of 
using these helpers. Consider removing these unused methods to reduce code 
clutter, or if they were intended for future use, document why they're being 
kept.
   ```suggestion
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add Kafka ingestion support for subset partitions [pinot]

Reply via email to