Hi Yang, You could also use Hadoop (i.e. Brisk), and run a MapReduce job or Hive query to extract and summarize/renormalize the data into whatever format you like.
If you use sstable2json, you have to run on every file on every node, deduplicate/merge all the output across machines, which is what MR does anyway. Our data flow is to take backups of a production cluster, restore a backup to a different cluster running Hadoop, then run our point in time data extraction for ETL processing by the BI team. The backup/restore gives a frozen in time (consistent to a second or so) cluster for extraction. Running live with Brisk means you are running your extraction over a moving target. Adrian On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy...@gmail.com> wrote: > Thanks Jonathan. > > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >> I'd modify SSTableExport.serializeRow (the sstable2json class) to >> output to whatever system you are targeting. >> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy...@gmail.com> wrote: >>> let's say periodically (daily) I need to dump out the contents of my >>> Cassandra DB, and do a import into oracle , or some other custom data >>> stores, >>> is there a way to do it? >>> >>> I checked that you can do multi-get() but you probably can't pass the >>> entire key domain into the API, cuz the entire db would be returned on >>> a single thrift call, and probably overflow the >>> API? plus multi-get underneath just sends out per-key lookups one by >>> one, while I really do not care about which key corresponds to which >>> result, a simple scraping of the underlying SSTable would >>> be perfect, because I could utilize the file cache coherency as I read >>> down the file. >>> >>> >>> Thanks >>> Yang >>> >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> >