I didn't explain clearly - I'm not requesting 20000 unknown keys (resulting in a full scan), I'm requesting 20000 specific rows by key. On Jun 10, 2014 6:02 PM, "DuyHai Doan" <doanduy...@gmail.com> wrote:
> Hello Jeremy > > Basically what you are doing is to ask Cassandra to do a distributed full > scan on all the partitions across the cluster, it's normal that the nodes > are somehow.... stressed. > > How did you make the query? Are you using Thrift or CQL3 API? > > Please note that there is another way to get all partition keys : SELECT > DISTINCT <partition_key> FROM..., more details here : > www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3 > I ran an application today that attempted to fetch 20,000+ unique row keys > in one query against a set of completely empty column families. On a 4-node > cluster (EC2 m1.large instances) with the recommended memory settings (2 GB > heap), every single node immediately ran out of memory and became > unresponsive, to the point where I had to kill -9 the cassandra processes. > > Now clearly this query is not the best idea in the world, but the effects > of it are a bit disturbing. What could be going on here? Are there any > other query pitfalls I should be aware of that have the potential to > explode the entire cluster? > > -j >