Query advice to prevent node overload

André Cruz Fri, 14 Sep 2012 03:56:47 -0700

Hello.

I have a schema that represents a filesystem and one example of a Super CF is:


CF FilesPerDir: (DIRNAME -> (FILENAME -> (attribute1: value1, attribute2: 
value2))

And in cases of directory moves, I have to fetch all files of that directory 
and subdirectories. This implies one cassandra query per dir, or a multiget for 
all needed dirs. The multiget() is more efficient, but I'm having trouble 
trying to limit the size of the data returned in order to not crash the 
cassandra node.

I'm using the pycassa client lib, and until now I have been doing per-directory 
get()s specifiying a column_count. This effectively limits the size of the 
dataset, but I would like to perform a multiget() to fetch the contents of 
multiple dirs at a time. The problem is that it seems that the column_count is 
per-key, and not global per dataset. So if I set column_count = 10000, as I 
have now, but fetch 1000 dirs (rows) and each one happens to have 10000 files 
(columns) the dataset is 1000x10000. Is there a better way to query for this 
data or does multiget deal with this through the "buffer_size"?

Thanks,
André

Query advice to prevent node overload

Reply via email to