Thanks Michael. I will make a benchmark using Hadoop Map/Reduce(example...) in our cluster. and any valuable information I will let you know. :)
Best, On Wed, Jan 30, 2013 at 2:39 PM, Michael Kjellman <mkjell...@barracuda.com>wrote: > And finally to make wide rows with C* and Hadoop even better, these > problems have already been solved by tickets such as (not inclusive): > > https://issues.apache.org/jira/browse/CASSANDRA-3264 > https://issues.apache.org/jira/browse/CASSANDRA-2878 > > And a nice more updated doc from the 1.1 branch from Datastax: > http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration > > From: Michael Kjellman <mkjell...@barracuda.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Tuesday, January 29, 2013 10:36 PM > > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Is there any way to fetch all data efficiently from a column > family? > > Yes, wide rows, but doesn't seem horrible by any means. People have gotten > by with Thrift for many many years in the community. If you are running > this once a day doesn't sound like latency should be a major concern and I > doubt the proto is going to be your primary bottleneck. > > To answer your question about describing pig: > http://pig.apache.org -- "Apache Pig is a platform for analyzing large > data sets that consists of a high-level language for expressing data > analysis programs, coupled with infrastructure for evaluating these > programs. The salient property of Pig programs is that their structure is > amenable to substantial parallelization, which in turns enables them to > handle very large data sets." > > Pretty much, pig lets you write in Pig Latin to create Map-Reduce programs > without writing an actual java Map Reduce program. > > Here is a really old wiki article that really needs to be updated about > the various Hadoop support built into C*: > http://wiki.apache.org/cassandra/HadoopSupport > > On your last point, compaction deals with tombstones yes but generally you > only run minor compactions. A major compaction says, take every sstable for > this cf and make one MASSIVE sstable from all the little sstables. This is > different than standard C* operations. Map/Reduce doesn't purge anything > and has nothing to do with compactions. It is just a somewhat sane idea I > thought of to let you iterate over a large amount of data stored in C*, and > conveniently C* provides Input and Output formats to Hadoop so you can do > fun things like iterate over 500w rows with 1k columns each. > > Honestly, the best thing you can do is benchmark Hadoop and see how it > will work for your work load and specific project requirements. > > Best, > Michael > > From: "dong.yajun" <dongt...@gmail.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Tuesday, January 29, 2013 10:11 PM > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Is there any way to fetch all data efficiently from a column > family? > > Thanks Michael. > * > * > > *> *How many rows in your column families? > abort 500w rows, each row has abort 1k data. > > > How often do you need to do this? > once a day. > > > example Hadoop map/reduce jobs in the examples folder > thanks, I have saw the source code, it uses the *Thrift API* as the > recordReader to interate the rows, I don't think it's a high performance > method. > > > you could look into Pig > could you please describe more details in Pig? > > > So avoid that unless you really know what you're doing which is what ... > the step is to purge the bombstones, another option is using the > map/reduce job to do the purging things without major compactions. > > > Best > > Rick. > > > On Wed, Jan 30, 2013 at 1:15 PM, Michael Kjellman <mkjell...@barracuda.com > > wrote: > >> How often do you need to do this? How many rows in your column families? >> >> If it's not a frequent operation you can just page the data n number of >> rows at a time using nothing special but C* and a driver. >> >> Or another option is you can write a map/reduce job if you need an entire >> cf to be an input if you only need one cf to be your input. There are >> example Hadoop map/reduce jobs in the examples folder included with >> Cassandra. Or if you don't want to write a M/R job you could look into Pig. >> >> Your method sounds a bit crazy IMHO and I'd definitely recommend against >> it. Better to let the database (C*) do it's thing. If you're super worried >> about more than 1 sstable you can do major compactions but that's not >> recommended as it will take a while to get a new sstable big enough to >> merge with the other big sstable. So avoid that unless you really know what >> you're doing which is what it sounds like your proposing in point 3 ;) >> >> From: "dong.yajun" <dongt...@gmail.com> >> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >> Date: Tuesday, January 29, 2013 9:02 PM >> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >> Subject: Is there any way to fetch all data efficiently from a column >> family? >> >> hey List, >> >> I consider a way that can read all data from a column family, the >> following is my thoughts: >> >> 1. make a snapshot for all nodes at the same time with a special column >> family in a cluster, >> >> 2. copy these sstables to local disk from cassandra nodes. >> >> 3. compact these sstables to a single one, >> >> 4. parse the sstable to each rows. >> >> My problem is the step2, assume that the replication factor is 3, then I >> need to copy the data size is: (3 * number of bytes for all rows with this >> column family), is there any proposals on this? >> >> -- >> *Rick Dong * >> >> > > > -- > *Ric Dong * > Newegg Ecommerce, MIS department > > -- *Ric Dong * Newegg Ecommerce, MIS department