How often do you need to do this? How many rows in your column families?

If it's not a frequent operation you can just page the data n number of rows at 
a time using nothing special but C* and a driver.

Or another option is you can write a map/reduce job if you need an entire cf to 
be an input if you only need one cf to be your input. There are example Hadoop 
map/reduce jobs in the examples folder included with Cassandra. Or if you don't 
want to write a M/R job you could look into Pig.

Your method sounds a bit crazy IMHO and I'd definitely recommend against it. 
Better to let the database (C*) do it's thing. If you're super worried about 
more than 1 sstable you can do major compactions but that's not recommended as 
it will take a while to get a new sstable big enough to merge with the other 
big sstable. So avoid that unless you really know what you're doing which is 
what it sounds like your proposing in point 3 ;)

From: "dong.yajun" <dongt...@gmail.com<mailto:dongt...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, January 29, 2013 9:02 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Is there any way to fetch all data efficiently from a column family?

hey List,

I consider a way that can read all data from a column family, the following is 
my thoughts:

1. make a snapshot for all nodes at the same time with a special column family 
in a cluster,

2. copy these sstables to local disk from cassandra nodes.

3. compact these sstables to a single one,

4. parse the sstable to each rows.

My problem is the step2, assume that the replication factor is 3, then I need 
to copy the data size is: (3 * number of bytes for all rows with this column 
family), is there any proposals on this?

--
Rick Dong

Reply via email to