thanks Sri I am trying to make sure that Brisk underneath does a simple scraping of the rows, instead of doing foreach key ( keys ) { lookup (key) }.. after that, I can feel comfortable using Brisk for the import/export jobs
yang On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati <srisat...@datastax.com> wrote: > Adrian, > +1 > Using hive & hadoop for the export-import of data from & to Cassandra is one > of the original use cases we had in mind for Brisk. That also has the > ability to parallelize the workload and finish rapidly. > thanks, > Sri > On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft > <adrian.cockcr...@gmail.com> wrote: >> >> Hi Yang, >> >> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or >> Hive query to extract and summarize/renormalize the data into whatever >> format you like. >> >> If you use sstable2json, you have to run on every file on every node, >> deduplicate/merge all the output across machines, which is what MR >> does anyway. >> >> Our data flow is to take backups of a production cluster, restore a >> backup to a different cluster running Hadoop, then run our point in >> time data extraction for ETL processing by the BI team. The >> backup/restore gives a frozen in time (consistent to a second or so) >> cluster for extraction. Running live with Brisk means you are running >> your extraction over a moving target. >> >> Adrian >> >> On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy...@gmail.com> wrote: >> > Thanks Jonathan. >> > >> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbel...@gmail.com> >> > wrote: >> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to >> >> output to whatever system you are targeting. >> >> >> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy...@gmail.com> wrote: >> >>> let's say periodically (daily) I need to dump out the contents of my >> >>> Cassandra DB, and do a import into oracle , or some other custom data >> >>> stores, >> >>> is there a way to do it? >> >>> >> >>> I checked that you can do multi-get() but you probably can't pass the >> >>> entire key domain into the API, cuz the entire db would be returned on >> >>> a single thrift call, and probably overflow the >> >>> API? plus multi-get underneath just sends out per-key lookups one by >> >>> one, while I really do not care about which key corresponds to which >> >>> result, a simple scraping of the underlying SSTable would >> >>> be perfect, because I could utilize the file cache coherency as I read >> >>> down the file. >> >>> >> >>> >> >>> Thanks >> >>> Yang >> >>> >> >> >> >> >> >> >> >> -- >> >> Jonathan Ellis >> >> Project Chair, Apache Cassandra >> >> co-founder of DataStax, the source for professional Cassandra support >> >> http://www.datastax.com >> >> >> > > > > > -- > SriSatish Ambati > Director of Engineering, DataStax > @srisatish > > > > >