How often do you need to do this? How many rows in your column families? If it's not a frequent operation you can just page the data n number of rows at a time using nothing special but C* and a driver.
Or another option is you can write a map/reduce job if you need an entire cf to be an input if you only need one cf to be your input. There are example Hadoop map/reduce jobs in the examples folder included with Cassandra. Or if you don't want to write a M/R job you could look into Pig. Your method sounds a bit crazy IMHO and I'd definitely recommend against it. Better to let the database (C*) do it's thing. If you're super worried about more than 1 sstable you can do major compactions but that's not recommended as it will take a while to get a new sstable big enough to merge with the other big sstable. So avoid that unless you really know what you're doing which is what it sounds like your proposing in point 3 ;) From: "dong.yajun" <dongt...@gmail.com<mailto:dongt...@gmail.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Tuesday, January 29, 2013 9:02 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Is there any way to fetch all data efficiently from a column family? hey List, I consider a way that can read all data from a column family, the following is my thoughts: 1. make a snapshot for all nodes at the same time with a special column family in a cluster, 2. copy these sstables to local disk from cassandra nodes. 3. compact these sstables to a single one, 4. parse the sstable to each rows. My problem is the step2, assume that the replication factor is 3, then I need to copy the data size is: (3 * number of bytes for all rows with this column family), is there any proposals on this? -- Rick Dong