Three ways to do this. Client app does get key for every row, lots of small network operations
brisk / hive does select(*), which is sent to each node to map then the hadoop network shuffle merges the result Write your own code to merge all the SStables across the cluster. So I think that brisk is going to be easier to implement but also closer in efficiency to the way you want to do it. Adrian On Monday, May 23, 2011, Yang <teddyyyy...@gmail.com> wrote: > thanks Sri > > I am trying to make sure that Brisk underneath does a simple scraping > of the rows, instead of doing foreach key ( keys ) { lookup (key) }.. > after that, I can feel comfortable using Brisk for the import/export jobs > > yang > > On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati > <srisat...@datastax.com> wrote: >> Adrian, >> +1 >> Using hive & hadoop for the export-import of data from & to Cassandra is one >> of the original use cases we had in mind for Brisk. That also has the >> ability to parallelize the workload and finish rapidly. >> thanks, >> Sri >> On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft >> <adrian.cockcr...@gmail.com> wrote: >>> >>> Hi Yang, >>> >>> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or >>> Hive query to extract and summarize/renormalize the data into whatever >>> format you like. >>> >>> If you use sstable2json, you have to run on every file on every node, >>> deduplicate/merge all the output across machines, which is what MR >>> does anyway. >>> >>> Our data flow is to take backups of a production cluster, restore a >>> backup to a different cluster running Hadoop, then run our point in >>> time data extraction for ETL processing by the BI team. The >>> backup/restore gives a frozen in time (consistent to a second or so) >>> cluster for extraction. Running live with Brisk means you are running >>> your extraction over a moving target. >>> >>> Adrian >>> >>> On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy...@gmail.com> wrote: >>> > Thanks Jonathan. >>> > >>> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbel...@gmail.com> >>> > wrote: >>> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to >>> >> output to whatever system you are targeting. >>> >> >>> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy...@gmail.com> wrote: >>> >>> let's say periodically (daily) I need to dump out the contents of my >>> >>> Cassandra DB, and do a import into oracle , or some other custom data >>> >>> stores, >>> >>> is there a way to do it? >>> >>> >>> >>> I checked that you can do multi-get() but you probably can't pass the >>> >>> entire key domain into the API, cuz the entire db would be returned on >>> >>> a single thrift call, and probably overflow the >>> >>> API? plus multi-get underneath just sends out per-key lookups one by >>> >>> one, while I really do not care about which key corresponds to which >>> >>> result, a simple scraping of the underlying SSTable would >>> >>> be perfect, because I could utilize the file cache coherency as I read >>> >>> down the file. >>> >>> >>> >>> >>> >>> Thanks >>> >>> Yang >>> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> Jonathan Ellis >>> >> Project Chair, Apache Cassandra >>> >> co-founder of DataStax, the source for professional Cassandra support >>> >> http://www.datastax.com >>> >> >>> > >> >> >> >> -- >> SriSatish Ambati >> Director of Engineering, DataStax >> @srisatish >> >> >> >> >> >