Hi Cassandra Community, I’m on Cassandra *4.1.10 *and am looking for some clarification on how the community recommends handling a specific issue with Cassandra range queries, specifically regarding coordinator-to-replica communication when cluster topology changes.
*The Context:* 1. When executing a range query, the coordinator node splits the input range into multiple smaller sub-ranges defined by vnode boundaries (optimizing to merge adjacent sub-ranges where possible). It then sends a replica request for each of these sub-ranges. 2. There is a fixed byte-size limit for the response of these internal replica requests (defaulting to 128MB, tunable via *internode_max_message_size* in cassandra.yaml). *The Problem:* Because the size of the sub-range response scales with the length of the sub-range, it naturally increases as the number of vnodes in the cluster decreases. Consequently, operations that reduce the overall number of vnodes—such as horizontally shrinking the cluster or migrating to a single-token architecture (e.g., from 16 vnodes to 1)—can cause previously successful range queries to fail. The internal sub-range requests simply become too heavy and exceed the internode message size limit. *Questions for the Community:* When encountering this failure, what is the recommended mitigation strategy? I see two primary approaches, but I'd appreciate some clarification on the trade-offs for both: 1. *Reduce the client-side page size:* Page size plays a crucial role in limiting the maximum size of internal range queries. However, reducing it increases the number of client-server round trips since more pages are required to fetch the same data. Are there specific guidelines on balancing this trade-off? 2. *Increase the internode size limit:* If we adjust the limit in cassandra.yaml, what is considered "too big"? How much larger than the default 128MB is generally considered safe? 3. *Other alternatives:* Are there any other pitfalls, workarounds, or tricks I should be aware of when dealing with this specific edge case? Thanks in advance for your insights!
