Thanks Michael. * * *> *How many rows in your column families? abort 500w rows, each row has abort 1k data.
> How often do you need to do this? once a day. > example Hadoop map/reduce jobs in the examples folder thanks, I have saw the source code, it uses the *Thrift API* as the recordReader to interate the rows, I don't think it's a high performance method. > you could look into Pig could you please describe more details in Pig? > So avoid that unless you really know what you're doing which is what ... the step is to purge the bombstones, another option is using the map/reduce job to do the purging things without major compactions. Best Rick. On Wed, Jan 30, 2013 at 1:15 PM, Michael Kjellman <mkjell...@barracuda.com>wrote: > How often do you need to do this? How many rows in your column families? > > If it's not a frequent operation you can just page the data n number of > rows at a time using nothing special but C* and a driver. > > Or another option is you can write a map/reduce job if you need an entire > cf to be an input if you only need one cf to be your input. There are > example Hadoop map/reduce jobs in the examples folder included with > Cassandra. Or if you don't want to write a M/R job you could look into Pig. > > Your method sounds a bit crazy IMHO and I'd definitely recommend against > it. Better to let the database (C*) do it's thing. If you're super worried > about more than 1 sstable you can do major compactions but that's not > recommended as it will take a while to get a new sstable big enough to > merge with the other big sstable. So avoid that unless you really know what > you're doing which is what it sounds like your proposing in point 3 ;) > > From: "dong.yajun" <dongt...@gmail.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Tuesday, January 29, 2013 9:02 PM > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Is there any way to fetch all data efficiently from a column > family? > > hey List, > > I consider a way that can read all data from a column family, the > following is my thoughts: > > 1. make a snapshot for all nodes at the same time with a special column > family in a cluster, > > 2. copy these sstables to local disk from cassandra nodes. > > 3. compact these sstables to a single one, > > 4. parse the sstable to each rows. > > My problem is the step2, assume that the replication factor is 3, then I > need to copy the data size is: (3 * number of bytes for all rows with this > column family), is there any proposals on this? > > -- > *Rick Dong * > > -- *Ric Dong * Newegg Ecommerce, MIS department