Thanks Michael.
*
*

*> *How many rows in your column families?
abort 500w rows, each row has abort 1k data.

> How often do you need to do this?
once a day.

> example Hadoop map/reduce jobs in the examples folder
thanks, I have saw the source code, it uses the *Thrift API* as the
recordReader to interate the rows,  I don't think it's a high performance
method.

> you could look into Pig
could you please describe more details in Pig?

> So avoid that unless you really know what you're doing which is what ...
the step is to purge the bombstones, another option is using the map/reduce
job to do  the purging things without major compactions.


Best

Rick.


On Wed, Jan 30, 2013 at 1:15 PM, Michael Kjellman
<mkjell...@barracuda.com>wrote:

> How often do you need to do this? How many rows in your column families?
>
> If it's not a frequent operation you can just page the data n number of
> rows at a time using nothing special but C* and a driver.
>
> Or another option is you can write a map/reduce job if you need an entire
> cf to be an input if you only need one cf to be your input. There are
> example Hadoop map/reduce jobs in the examples folder included with
> Cassandra. Or if you don't want to write a M/R job you could look into Pig.
>
> Your method sounds a bit crazy IMHO and I'd definitely recommend against
> it. Better to let the database (C*) do it's thing. If you're super worried
> about more than 1 sstable you can do major compactions but that's not
> recommended as it will take a while to get a new sstable big enough to
> merge with the other big sstable. So avoid that unless you really know what
> you're doing which is what it sounds like your proposing in point 3 ;)
>
> From: "dong.yajun" <dongt...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Tuesday, January 29, 2013 9:02 PM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Is there any way to fetch all data efficiently from a column
> family?
>
> hey List,
>
> I consider a way that can read all data from a column family, the
> following is my thoughts:
>
> 1. make a snapshot for all nodes at the same time with a special column
> family in a cluster,
>
> 2. copy these sstables to local disk from cassandra nodes.
>
> 3. compact these sstables to a single one,
>
> 4. parse the sstable to each rows.
>
> My problem is the step2, assume that the replication factor is 3, then I
> need to copy the data size is: (3 * number of bytes for all rows with this
> column family), is there any proposals on this?
>
> --
> *Rick Dong *
>
>


-- 
*Ric Dong *
Newegg Ecommerce, MIS department

Reply via email to