And finally to make wide rows with C* and Hadoop even better, these problems 
have already been solved by tickets such as (not inclusive):

https://issues.apache.org/jira/browse/CASSANDRA-3264
https://issues.apache.org/jira/browse/CASSANDRA-2878

And a nice more updated doc from the 1.1 branch from Datastax:
http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration

From: Michael Kjellman <mkjell...@barracuda.com<mailto:mkjell...@barracuda.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, January 29, 2013 10:36 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Is there any way to fetch all data efficiently from a column 
family?

Yes, wide rows, but doesn't seem horrible by any means. People have gotten by 
with Thrift for many many years in the community. If you are running this once 
a day doesn't sound like latency should be a major concern and I doubt the 
proto is going to be your primary bottleneck.

To answer your question about describing pig:
http://pig.apache.org -- "Apache Pig is a platform for analyzing large data 
sets that consists of a high-level language for expressing data analysis 
programs, coupled with infrastructure for evaluating these programs. The 
salient property of Pig programs is that their structure is amenable to 
substantial parallelization, which in turns enables them to handle very large 
data sets."

Pretty much, pig lets you write in Pig Latin to create Map-Reduce programs 
without writing an actual java Map Reduce program.

Here is a really old wiki article that really needs to be updated about the 
various Hadoop support built into C*: 
http://wiki.apache.org/cassandra/HadoopSupport

On your last point, compaction deals with tombstones yes but generally you only 
run minor compactions. A major compaction says, take every sstable for this cf 
and make one MASSIVE sstable from all the little sstables. This is different 
than standard C* operations. Map/Reduce doesn't purge anything and has nothing 
to do with compactions. It is just a somewhat sane idea I thought of to let you 
iterate over a large amount of data stored in C*, and conveniently C* provides 
Input and Output formats to Hadoop so you can do fun things like iterate over 
500w rows with 1k columns each.

Honestly, the best thing you can do is benchmark Hadoop and see how it will 
work for your work load and specific project requirements.

Best,
Michael

From: "dong.yajun" <dongt...@gmail.com<mailto:dongt...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, January 29, 2013 10:11 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Is there any way to fetch all data efficiently from a column 
family?

Thanks Michael.

> How many rows in your column families?
abort 500w rows, each row has abort 1k data.

> How often do you need to do this?
once a day.

> example Hadoop map/reduce jobs in the examples folder
thanks, I have saw the source code, it uses the Thrift API as the recordReader 
to interate the rows,  I don't think it's a high performance method.

> you could look into Pig
could you please describe more details in Pig?

> So avoid that unless you really know what you're doing which is what ...
the step is to purge the bombstones, another option is using the map/reduce job 
to do  the purging things without major compactions.


Best

Rick.

On Wed, Jan 30, 2013 at 1:15 PM, Michael Kjellman 
<mkjell...@barracuda.com<mailto:mkjell...@barracuda.com>> wrote:
How often do you need to do this? How many rows in your column families?

If it's not a frequent operation you can just page the data n number of rows at 
a time using nothing special but C* and a driver.

Or another option is you can write a map/reduce job if you need an entire cf to 
be an input if you only need one cf to be your input. There are example Hadoop 
map/reduce jobs in the examples folder included with Cassandra. Or if you don't 
want to write a M/R job you could look into Pig.

Your method sounds a bit crazy IMHO and I'd definitely recommend against it. 
Better to let the database (C*) do it's thing. If you're super worried about 
more than 1 sstable you can do major compactions but that's not recommended as 
it will take a while to get a new sstable big enough to merge with the other 
big sstable. So avoid that unless you really know what you're doing which is 
what it sounds like your proposing in point 3 ;)

From: "dong.yajun" <dongt...@gmail.com<mailto:dongt...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, January 29, 2013 9:02 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Is there any way to fetch all data efficiently from a column family?

hey List,

I consider a way that can read all data from a column family, the following is 
my thoughts:

1. make a snapshot for all nodes at the same time with a special column family 
in a cluster,

2. copy these sstables to local disk from cassandra nodes.

3. compact these sstables to a single one,

4. parse the sstable to each rows.

My problem is the step2, assume that the replication factor is 3, then I need 
to copy the data size is: (3 * number of bytes for all rows with this column 
family), is there any proposals on this?

--
Rick Dong




--
Ric Dong
Newegg Ecommerce, MIS department

Reply via email to