Re: Is there any way to fetch all data efficiently from a column family?

dong.yajun Wed, 30 Jan 2013 00:02:57 -0800

Thanks Michael.

I will make a benchmark using Hadoop Map/Reduce(example...) in our cluster.
 and any valuable information I will let you know.  :)


Best,

On Wed, Jan 30, 2013 at 2:39 PM, Michael Kjellman
<mkjell...@barracuda.com>wrote:

> And finally to make wide rows with C* and Hadoop even better, these
> problems have already been solved by tickets such as (not inclusive):
>
> https://issues.apache.org/jira/browse/CASSANDRA-3264
> https://issues.apache.org/jira/browse/CASSANDRA-2878
>
> And a nice more updated doc from the 1.1 branch from Datastax:
> http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration
>
> From: Michael Kjellman <mkjell...@barracuda.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Tuesday, January 29, 2013 10:36 PM
>
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Is there any way to fetch all data efficiently from a column
> family?
>
> Yes, wide rows, but doesn't seem horrible by any means. People have gotten
> by with Thrift for many many years in the community. If you are running
> this once a day doesn't sound like latency should be a major concern and I
> doubt the proto is going to be your primary bottleneck.
>
> To answer your question about describing pig:
> http://pig.apache.org -- "Apache Pig is a platform for analyzing large
> data sets that consists of a high-level language for expressing data
> analysis programs, coupled with infrastructure for evaluating these
> programs. The salient property of Pig programs is that their structure is
> amenable to substantial parallelization, which in turns enables them to
> handle very large data sets."
>
> Pretty much, pig lets you write in Pig Latin to create Map-Reduce programs
> without writing an actual java Map Reduce program.
>
> Here is a really old wiki article that really needs to be updated about
> the various Hadoop support built into C*:
> http://wiki.apache.org/cassandra/HadoopSupport
>
> On your last point, compaction deals with tombstones yes but generally you
> only run minor compactions. A major compaction says, take every sstable for
> this cf and make one MASSIVE sstable from all the little sstables. This is
> different than standard C* operations. Map/Reduce doesn't purge anything
> and has nothing to do with compactions. It is just a somewhat sane idea I
> thought of to let you iterate over a large amount of data stored in C*, and
> conveniently C* provides Input and Output formats to Hadoop so you can do
> fun things like iterate over 500w rows with 1k columns each.
>
> Honestly, the best thing you can do is benchmark Hadoop and see how it
> will work for your work load and specific project requirements.
>
> Best,
> Michael
>
> From: "dong.yajun" <dongt...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Tuesday, January 29, 2013 10:11 PM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Is there any way to fetch all data efficiently from a column
> family?
>
> Thanks Michael.
> *
> *
>
> *> *How many rows in your column families?
> abort 500w rows, each row has abort 1k data.
>
> > How often do you need to do this?
> once a day.
>
> > example Hadoop map/reduce jobs in the examples folder
> thanks, I have saw the source code, it uses the *Thrift API* as the
> recordReader to interate the rows,  I don't think it's a high performance
> method.
>
> > you could look into Pig
> could you please describe more details in Pig?
>
> > So avoid that unless you really know what you're doing which is what ...
> the step is to purge the bombstones, another option is using the
> map/reduce job to do  the purging things without major compactions.
>
>
> Best
>
> Rick.
>
>
> On Wed, Jan 30, 2013 at 1:15 PM, Michael Kjellman <mkjell...@barracuda.com
> > wrote:
>
>> How often do you need to do this? How many rows in your column families?
>>
>> If it's not a frequent operation you can just page the data n number of
>> rows at a time using nothing special but C* and a driver.
>>
>> Or another option is you can write a map/reduce job if you need an entire
>> cf to be an input if you only need one cf to be your input. There are
>> example Hadoop map/reduce jobs in the examples folder included with
>> Cassandra. Or if you don't want to write a M/R job you could look into Pig.
>>
>> Your method sounds a bit crazy IMHO and I'd definitely recommend against
>> it. Better to let the database (C*) do it's thing. If you're super worried
>> about more than 1 sstable you can do major compactions but that's not
>> recommended as it will take a while to get a new sstable big enough to
>> merge with the other big sstable. So avoid that unless you really know what
>> you're doing which is what it sounds like your proposing in point 3 ;)
>>
>> From: "dong.yajun" <dongt...@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Tuesday, January 29, 2013 9:02 PM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Is there any way to fetch all data efficiently from a column
>> family?
>>
>> hey List,
>>
>> I consider a way that can read all data from a column family, the
>> following is my thoughts:
>>
>> 1. make a snapshot for all nodes at the same time with a special column
>> family in a cluster,
>>
>> 2. copy these sstables to local disk from cassandra nodes.
>>
>> 3. compact these sstables to a single one,
>>
>> 4. parse the sstable to each rows.
>>
>> My problem is the step2, assume that the replication factor is 3, then I
>> need to copy the data size is: (3 * number of bytes for all rows with this
>> column family), is there any proposals on this?
>>
>> --
>> *Rick Dong *
>>
>>
>
>
> --
> *Ric Dong *
> Newegg Ecommerce, MIS department
>
>


-- 
*Ric Dong *
Newegg Ecommerce, MIS department

Re: Is there any way to fetch all data efficiently from a column family?

Reply via email to