Hi Yang,

You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
Hive query to extract and summarize/renormalize the data into whatever
format you like.

If you use sstable2json, you have to run on every file on every node,
deduplicate/merge all the output across machines, which is what MR
does anyway.

Our data flow is to take backups of a production cluster, restore a
backup to a different cluster running Hadoop, then run our point in
time data extraction for ETL processing by the BI team. The
backup/restore gives a frozen in time (consistent to a second or so)
cluster for extraction. Running live with Brisk means you are running
your extraction over a moving target.

Adrian

On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy...@gmail.com> wrote:
> Thanks Jonathan.
>
> On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>> I'd modify SSTableExport.serializeRow (the sstable2json class) to
>> output to whatever system you are targeting.
>>
>> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy...@gmail.com> wrote:
>>> let's say periodically (daily) I need to dump out the contents of my
>>> Cassandra DB, and do a import into oracle , or some other custom data
>>> stores,
>>> is there a way to do it?
>>>
>>> I checked that you can do multi-get() but you probably can't pass the
>>> entire key domain into the API, cuz the entire db would be returned on
>>> a single thrift call, and probably overflow the
>>> API? plus multi-get underneath just sends out per-key lookups one by
>>> one, while I really do not care about which key corresponds to which
>>> result, a simple scraping of the underlying SSTable would
>>> be perfect, because I could utilize the file cache coherency as I read
>>> down the file.
>>>
>>>
>>> Thanks
>>> Yang
>>>
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>

Reply via email to