The best way to generate dumps from Cassandra is via Hadoop integration (or spark). You can find more info here:
http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html http://wiki.apache.org/cassandra/HadoopSupport On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar <gbhatna...@gmail.com> wrote: > Hi, > We have a Cassandra database column family containing 320 millions rows > and each row contains about 15 columns. We want to take monthly dump of > this single column family contained in this database in text format. > > We are planning to take following approach to implement this functionality > 1. Take a snapshot of Cassandra database using nodetool utility. We > specify -cf flag to > specify column family name so that snapshot contains data > corresponding to a single > column family. > 2. We take backup of this snapshot and move this backup to a separate > physical machine. > 3. We using "SStable to json conversion" utility to json convert all the > data files into json > format. > > We have following questions/doubts regarding the above approach > a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json > record > and can I safely ignore all such json records? > b) If I ignore all records marked by "d" flag, than can generated json > files in step 3, contain > duplicate records? I mean do multiple entries for same key. > > Do there can be any other better approach to generate data dumps in text > format. > > Regards, > Gaurav > -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br <http://www.chaordic.com.br/>* +55 48 3232.3200