You might also want to consider tools like https://github.com/Netflix/aegisthus for the last step, which can help you deal with tombstones and de-duplicate data.
Thanks, Daniel On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar <gbhatna...@gmail.com> wrote: > Hi, > We have a Cassandra database column family containing 320 millions rows > and each row contains about 15 columns. We want to take monthly dump of > this single column family contained in this database in text format. > > We are planning to take following approach to implement this functionality > 1. Take a snapshot of Cassandra database using nodetool utility. We > specify -cf flag to > specify column family name so that snapshot contains data > corresponding to a single > column family. > 2. We take backup of this snapshot and move this backup to a separate > physical machine. > 3. We using "SStable to json conversion" utility to json convert all the > data files into json > format. > > We have following questions/doubts regarding the above approach > a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json > record > and can I safely ignore all such json records? > b) If I ignore all records marked by "d" flag, than can generated json > files in step 3, contain > duplicate records? I mean do multiple entries for same key. > > Do there can be any other better approach to generate data dumps in text > format. > > Regards, > Gaurav >