github-actions[bot] commented on code in PR #64638:
URL: https://github.com/apache/doris/pull/64638#discussion_r3434682078


##########
fe/fe-common/src/main/java/org/apache/doris/common/Config.java:
##########
@@ -3389,6 +3389,12 @@ public static int metaServiceRpcRetryTimes() {
                             + "to other BEs in cloud mode."})
     public static int rehash_tablet_after_be_dead_seconds = 3600;
 
+    @ConfField(mutable = true, masterOnly = true,
+            description = {
+                    "Whether to use rendezvous hashing for colocate bucket 
placement in cloud mode. "
+                            + "If false, use the legacy modulo placement."})
+    public static boolean enable_cloud_colocate_consistent_hash = true;

Review Comment:
   This config is mutable, but placement is read directly from `Config` for 
each `CloudReplica` while `OlapScanNode` builds scan ranges. If an admin flips 
it with `ADMIN SET ALL FRONTENDS CONFIG` while a colocate join is being 
planned, one scan node/bucket can be assigned with modulo and the matching 
table with HRW. `LoadBalanceScanWorkerSelector` then chooses the worker from 
the first scan node and `filterReplicaByWorkerInBucket` throws `Can not find 
tablet ...` for the second. Please either make the switch 
non-mutable/restart-only, or snapshot the placement mode once per 
statement/scan-range build and pass that snapshot through all placement calls.



##########
fe/fe-core/src/main/java/org/apache/doris/catalog/ColocateTableIndex.java:
##########
@@ -427,6 +427,11 @@ public GroupId getGroupNoLock(long tableId) {
         return table2Group.get(tableId);
     }
 
+    public int getBucketsNumNoLock(GroupId groupId) {
+        Preconditions.checkState(group2Schema.containsKey(groupId));

Review Comment:
   `group2Schema` is a plain `HashMap` protected by `ColocateTableIndex.lock` 
everywhere else. This no-lock read is called from `CloudReplica.getBackendId()` 
without the index read lock, while `removeTable()`, `addTableToGroup()`, and 
replay paths mutate `group2Schema` under the write lock. A concurrent 
drop/replay can make `containsKey` true and then `get` return null, or race 
with `HashMap` internals. The existing no-lock accessor is safe only because 
`table2Group` is a `ConcurrentMap`. Please use a locked accessor or make this 
map a concurrent/safely-published structure before reading it from the hot path.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -97,8 +100,94 @@ public class CloudSystemInfoService extends 
SystemInfoService {
     // clusterId -> ComputeGroup
     protected Map<String, ComputeGroup> computeGroupIdToComputeGroup = new 
ConcurrentHashMap<>();

Review Comment:
   `colocatePlacementCache` has no lifecycle cleanup or bound. Every queried 
`(GroupId, clusterId)` stores a `long[bucketNum]`, but 
`ColocateTableIndex.removeTable()` removes `group2Schema`/`group2Tables` 
without clearing this cache, and dropped compute groups are not cleared either. 
A workload that repeatedly creates/drops colocate groups and queries them pins 
those arrays in `CloudSystemInfoService` for the FE lifetime. Please evict 
cache entries when a colocate group/compute group is removed, or use a bounded 
cache tied to the metadata lifecycle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to