Copilot commented on code in PR #17587: URL: https://github.com/apache/pinot/pull/17587#discussion_r2841534901
########## pinot-plugins/pinot-stream-ingestion/pinot-kafka-base/src/main/java/org/apache/pinot/plugin/stream/kafka/KafkaPartitionSubsetUtils.java: ########## @@ -0,0 +1,79 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.pinot.plugin.stream.kafka; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import javax.annotation.Nullable; +import org.apache.commons.lang3.StringUtils; + + +/** + * Utilities for parsing and validating Kafka partition subset configuration + * (stream.kafka.partition.ids) from stream config. + */ +public final class KafkaPartitionSubsetUtils { + + private KafkaPartitionSubsetUtils() { + } + + /** + * Reads the optional comma-separated partition ID list from the stream config map. + * Returns a sorted, deduplicated list for stable ordering when used for partition group metadata. + * Duplicate IDs in the config are silently removed to avoid duplicate processing of the same partition. + * + * @param streamConfigMap table stream config map (e.g. from + * {@link org.apache.pinot.spi.stream.StreamConfig#getStreamConfigsMap()}) + * @return Sorted list of unique partition IDs when stream.kafka.partition.ids is set and non-empty; + * null when not set or blank + * @throws IllegalArgumentException if the value contains invalid (non-integer) entries + */ + @Nullable + public static List<Integer> getPartitionIdsFromConfig(Map<String, String> streamConfigMap) { + String key = KafkaStreamConfigProperties.constructStreamProperty(KafkaStreamConfigProperties.PARTITION_IDS); + String value = streamConfigMap.get(key); + if (StringUtils.isBlank(value)) { + return null; + } + String[] parts = value.split(","); + Set<Integer> idSet = new HashSet<>(parts.length); + for (String part : parts) { + String trimmed = part.trim(); + if (trimmed.isEmpty()) { + continue; + } + try { + idSet.add(Integer.parseInt(trimmed)); + } catch (NumberFormatException e) { + throw new IllegalArgumentException( + "Invalid " + key + " value: expected comma-separated integers, got '" + value + "'", e); + } + } + if (idSet.isEmpty()) { + return null; + } + List<Integer> ids = new ArrayList<>(idSet); + Collections.sort(ids); + return ids; + } Review Comment: The method accepts negative partition IDs without validation. Kafka partition IDs are always non-negative integers. Consider adding validation to reject negative values and throw an IllegalArgumentException with a clear message, such as "Partition IDs must be non-negative". This will provide early feedback if a user mistakenly configures negative values like "-1,0,1". ########## pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/main/java/org/apache/pinot/plugin/stream/kafka40/KafkaStreamMetadataProvider.java: ########## @@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends KafkaPartitionLevelConnectionHa implements StreamMetadataProvider { private static final Logger LOGGER = LoggerFactory.getLogger(KafkaStreamMetadataProvider.class); + /** Whether this table consumes only a subset of topic partitions (from stream.kafka.partition.ids). */ + private final boolean _partialPartitions; + /** + * Immutable partition ID subset from table config. Read once at construction; does not change during the + * provider's lifetime. To change the subset, update the table config and restart the consumer. + */ + private final List<Integer> _partitionIdSubset; public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig) { this(clientId, streamConfig, Integer.MIN_VALUE); } public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig, int partition) { super(clientId, streamConfig, partition); + List<Integer> subset = + KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap()); + if (subset != null) { + // The partition subset comes from the table config and is expected to remain stable until config update. + _partialPartitions = true; + _partitionIdSubset = Collections.unmodifiableList(subset); + } else { + _partialPartitions = false; + _partitionIdSubset = Collections.emptyList(); + } + } Review Comment: The PR description mentions "validate configured IDs against topic metadata", but the implementation doesn't validate that the configured partition IDs actually exist in the Kafka topic. While implicit validation will occur when attempting to consume (resulting in runtime errors), early validation during metadata provider construction could provide clearer feedback. Consider adding validation that fetches topic partition metadata and verifies all configured partition IDs exist, or update the PR description if this validation was intentionally deferred to runtime. ########## pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/main/java/org/apache/pinot/plugin/stream/kafka30/KafkaStreamMetadataProvider.java: ########## @@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends KafkaPartitionLevelConnectionHa implements StreamMetadataProvider { private static final Logger LOGGER = LoggerFactory.getLogger(KafkaStreamMetadataProvider.class); + /** Whether this table consumes only a subset of topic partitions (from stream.kafka.partition.ids). */ + private final boolean _partialPartitions; + /** + * Immutable partition ID subset from table config. Read once at construction; does not change during the + * provider's lifetime. To change the subset, update the table config and restart the consumer. + */ + private final List<Integer> _partitionIdSubset; public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig) { this(clientId, streamConfig, Integer.MIN_VALUE); } public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig, int partition) { super(clientId, streamConfig, partition); + List<Integer> subset = + KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap()); + if (subset != null) { + // The partition subset comes from the table config and is expected to remain stable until config update. + _partialPartitions = true; + _partitionIdSubset = Collections.unmodifiableList(subset); + } else { + _partialPartitions = false; + _partitionIdSubset = Collections.emptyList(); + } + } Review Comment: The PR description mentions "validate configured IDs against topic metadata", but the implementation doesn't validate that the configured partition IDs actually exist in the Kafka topic. While implicit validation will occur when attempting to consume (resulting in runtime errors), early validation during metadata provider construction could provide clearer feedback. Consider adding validation that fetches topic partition metadata and verifies all configured partition IDs exist, or update the PR description if this validation was intentionally deferred to runtime. ########## pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/main/java/org/apache/pinot/plugin/stream/kafka30/KafkaStreamMetadataProvider.java: ########## @@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends KafkaPartitionLevelConnectionHa implements StreamMetadataProvider { private static final Logger LOGGER = LoggerFactory.getLogger(KafkaStreamMetadataProvider.class); + /** Whether this table consumes only a subset of topic partitions (from stream.kafka.partition.ids). */ + private final boolean _partialPartitions; + /** + * Immutable partition ID subset from table config. Read once at construction; does not change during the + * provider's lifetime. To change the subset, update the table config and restart the consumer. + */ + private final List<Integer> _partitionIdSubset; public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig) { this(clientId, streamConfig, Integer.MIN_VALUE); } public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig, int partition) { super(clientId, streamConfig, partition); + List<Integer> subset = + KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap()); + if (subset != null) { + // The partition subset comes from the table config and is expected to remain stable until config update. + _partialPartitions = true; + _partitionIdSubset = Collections.unmodifiableList(subset); + } else { + _partialPartitions = false; + _partitionIdSubset = Collections.emptyList(); + } + } + + private List<PartitionInfo> getPartitionInfos(long timeoutMillis) { + List<PartitionInfo> partitionInfos = _consumer.partitionsFor(_topic, Duration.ofMillis(timeoutMillis)); + if (CollectionUtils.isEmpty(partitionInfos)) { + throw new RuntimeException("Failed to fetch partition information for topic: " + _topic); + } + return partitionInfos; + } + + private Set<Integer> toPartitionIdSet(List<PartitionInfo> partitionInfos) { + Set<Integer> partitionIds = Sets.newHashSetWithExpectedSize(partitionInfos.size()); + for (PartitionInfo partitionInfo : partitionInfos) { + partitionIds.add(partitionInfo.partition()); + } + return partitionIds; } Review Comment: These two private methods (`getPartitionInfos` and `toPartitionIdSet`) are defined but never used in this class. The `fetchPartitionCount` and `fetchPartitionIds` methods call `fetchPartitionInfos` directly instead of using these helpers. Consider removing these unused methods to reduce code clutter, or if they were intended for future use, document why they're being kept. ```suggestion ``` ########## pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/main/java/org/apache/pinot/plugin/stream/kafka40/KafkaStreamMetadataProvider.java: ########## @@ -58,18 +64,54 @@ public class KafkaStreamMetadataProvider extends KafkaPartitionLevelConnectionHa implements StreamMetadataProvider { private static final Logger LOGGER = LoggerFactory.getLogger(KafkaStreamMetadataProvider.class); + /** Whether this table consumes only a subset of topic partitions (from stream.kafka.partition.ids). */ + private final boolean _partialPartitions; + /** + * Immutable partition ID subset from table config. Read once at construction; does not change during the + * provider's lifetime. To change the subset, update the table config and restart the consumer. + */ + private final List<Integer> _partitionIdSubset; public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig) { this(clientId, streamConfig, Integer.MIN_VALUE); } public KafkaStreamMetadataProvider(String clientId, StreamConfig streamConfig, int partition) { super(clientId, streamConfig, partition); + List<Integer> subset = + KafkaPartitionSubsetUtils.getPartitionIdsFromConfig(_config.getStreamConfigMap()); + if (subset != null) { + // The partition subset comes from the table config and is expected to remain stable until config update. + _partialPartitions = true; + _partitionIdSubset = Collections.unmodifiableList(subset); + } else { + _partialPartitions = false; + _partitionIdSubset = Collections.emptyList(); + } + } + + private List<PartitionInfo> getPartitionInfos(long timeoutMillis) { + List<PartitionInfo> partitionInfos = _consumer.partitionsFor(_topic, Duration.ofMillis(timeoutMillis)); + if (CollectionUtils.isEmpty(partitionInfos)) { + throw new RuntimeException("Failed to fetch partition information for topic: " + _topic); + } + return partitionInfos; + } + + private Set<Integer> toPartitionIdSet(List<PartitionInfo> partitionInfos) { + Set<Integer> partitionIds = Sets.newHashSetWithExpectedSize(partitionInfos.size()); + for (PartitionInfo partitionInfo : partitionInfos) { + partitionIds.add(partitionInfo.partition()); + } + return partitionIds; } Review Comment: These two private methods (`getPartitionInfos` and `toPartitionIdSet`) are defined but never used in this class. The `fetchPartitionCount` and `fetchPartitionIds` methods call `fetchPartitionInfos` directly instead of using these helpers. Consider removing these unused methods to reduce code clutter, or if they were intended for future use, document why they're being kept. ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
