hey Jens Thank you so much for the advise and reading through. So are you saying that I should query cassandra right away? If yes, like I mentioned, I have to run this during traffic hours. Isnt there a possibility then that my traffic to the db may get impacted? also is it okay to use hector to this?
Best On Mon, Jan 26, 2015 at 2:19 PM, Jens Rantil <jens.ran...@tink.se> wrote: > Hi Parth, > > I’ll take your questions in order: > > 1. Have a look at the compaction subproperties for STCS: > http://datastax.com/documentation/cql/3.1/cql/cql_reference/compactSubprop.html > > 2. Why not talk to Cassandra when generating the report? It will be waaay > faster (and easier!); Cassandra will use bloom filters, handle shadowed > (overwritten) columns, handle tombstones for you, not the mention the fact > that it uses sstables that are hot in OS file cache. > > 3. See 2) above. Also, your approach requires you to implement handling of > shadowed columns as well as tombstone handling which could be pretty messy. > > Cheers, > Jens > > ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se > Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter > > > On Mon, Jan 26, 2015 at 7:40 AM, Parth Setya <setya.pa...@gmail.com> > wrote: > >> Hi >> >> >> *Setup* >> >> *3 Node Cluster* >> Api- >> * Hector*CL- >> * QUORUM* >> RF- >> *3* >> Compaction Strategy- >> *Size Tiered Compaction* >> >> *Use Case* >> I have about *320 million rows*(~12 to 15 columns each) worth of data >> stored in Cassandra. In order to generate a report containing ALL that >> data, I do the following: >> 1. Run Compaction >> 2. Take a snapshot of the db >> 3. Run sstable2json on all the *Data.db files >> 4. Read those jsons and write to a csv. >> >> *Problem*: >> The *sstable2json* utility takes about 350-400 hours (~85% of the total >> time) thereby lengthening the process. (I am running sstable2json >> sequentially on all the *Data.db files but the size of those is >> inconsistent so making it run concurrently doesn't help either E.G one file >> is of size 25 GB while another of 500 MB) >> >> *My Thought Process:* >> Is there a way to put a cap on the maximum size of the sstables that are >> generated after compaction such that i have multiple sstables of uniform >> size. Then I can run sstable2json utility on the same concurrently >> >> *Questions:* >> 1. Is there a way to configure the size of sstables created after >> compaction? >> 2. Is there a better approach to generate the report? >> 3. What are the flaws with this approach? >> >> Best >> Parth >> >> >