And finally to make wide rows with C* and Hadoop even better, these problems have already been solved by tickets such as (not inclusive):
https://issues.apache.org/jira/browse/CASSANDRA-3264 https://issues.apache.org/jira/browse/CASSANDRA-2878 And a nice more updated doc from the 1.1 branch from Datastax: http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration From: Michael Kjellman <mkjell...@barracuda.com<mailto:mkjell...@barracuda.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Tuesday, January 29, 2013 10:36 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: Is there any way to fetch all data efficiently from a column family? Yes, wide rows, but doesn't seem horrible by any means. People have gotten by with Thrift for many many years in the community. If you are running this once a day doesn't sound like latency should be a major concern and I doubt the proto is going to be your primary bottleneck. To answer your question about describing pig: http://pig.apache.org -- "Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets." Pretty much, pig lets you write in Pig Latin to create Map-Reduce programs without writing an actual java Map Reduce program. Here is a really old wiki article that really needs to be updated about the various Hadoop support built into C*: http://wiki.apache.org/cassandra/HadoopSupport On your last point, compaction deals with tombstones yes but generally you only run minor compactions. A major compaction says, take every sstable for this cf and make one MASSIVE sstable from all the little sstables. This is different than standard C* operations. Map/Reduce doesn't purge anything and has nothing to do with compactions. It is just a somewhat sane idea I thought of to let you iterate over a large amount of data stored in C*, and conveniently C* provides Input and Output formats to Hadoop so you can do fun things like iterate over 500w rows with 1k columns each. Honestly, the best thing you can do is benchmark Hadoop and see how it will work for your work load and specific project requirements. Best, Michael From: "dong.yajun" <dongt...@gmail.com<mailto:dongt...@gmail.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Tuesday, January 29, 2013 10:11 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: Is there any way to fetch all data efficiently from a column family? Thanks Michael. > How many rows in your column families? abort 500w rows, each row has abort 1k data. > How often do you need to do this? once a day. > example Hadoop map/reduce jobs in the examples folder thanks, I have saw the source code, it uses the Thrift API as the recordReader to interate the rows, I don't think it's a high performance method. > you could look into Pig could you please describe more details in Pig? > So avoid that unless you really know what you're doing which is what ... the step is to purge the bombstones, another option is using the map/reduce job to do the purging things without major compactions. Best Rick. On Wed, Jan 30, 2013 at 1:15 PM, Michael Kjellman <mkjell...@barracuda.com<mailto:mkjell...@barracuda.com>> wrote: How often do you need to do this? How many rows in your column families? If it's not a frequent operation you can just page the data n number of rows at a time using nothing special but C* and a driver. Or another option is you can write a map/reduce job if you need an entire cf to be an input if you only need one cf to be your input. There are example Hadoop map/reduce jobs in the examples folder included with Cassandra. Or if you don't want to write a M/R job you could look into Pig. Your method sounds a bit crazy IMHO and I'd definitely recommend against it. Better to let the database (C*) do it's thing. If you're super worried about more than 1 sstable you can do major compactions but that's not recommended as it will take a while to get a new sstable big enough to merge with the other big sstable. So avoid that unless you really know what you're doing which is what it sounds like your proposing in point 3 ;) From: "dong.yajun" <dongt...@gmail.com<mailto:dongt...@gmail.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Tuesday, January 29, 2013 9:02 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Is there any way to fetch all data efficiently from a column family? hey List, I consider a way that can read all data from a column family, the following is my thoughts: 1. make a snapshot for all nodes at the same time with a special column family in a cluster, 2. copy these sstables to local disk from cassandra nodes. 3. compact these sstables to a single one, 4. parse the sstable to each rows. My problem is the step2, assume that the replication factor is 3, then I need to copy the data size is: (3 * number of bytes for all rows with this column family), is there any proposals on this? -- Rick Dong -- Ric Dong Newegg Ecommerce, MIS department