Parth,
> So are you saying that I should query cassandra right away? Well, don’t take my word for it, but it definitely sounds like a more simple approach. > If yes, like I mentioned, I have to run this during traffic hours. Isnt there > a possibility then that my traffic to the db may get impacted? Absolutely, it could. But so will converting your sstables to JSON. But a database is also made to be read from ;) I suggest you set up a test cluster and try the load impact before you try other ways (such as dumping database etc.). If load is too high you could also incorporate some kind of rate limiting and/or concurrency limit on your report generation. I also know that people have succesfully used Spark or similar infrastructure for batch processing of Cassandra data. Not sure, but could be useful for you to look into. > also is it okay to use hector to this? I have no personal experience with Hector, but I suppose so. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Mon, Jan 26, 2015 at 9:57 AM, Parth Setya <setya.pa...@gmail.com> wrote: > hey Jens > Thank you so much for the advise and reading through. > So are you saying that I should query cassandra right away? > If yes, like I mentioned, I have to run this during traffic hours. Isnt > there a possibility then that my traffic to the db may get impacted? > also is it okay to use hector to this? > Best > On Mon, Jan 26, 2015 at 2:19 PM, Jens Rantil <jens.ran...@tink.se> wrote: >> Hi Parth, >> >> I’ll take your questions in order: >> >> 1. Have a look at the compaction subproperties for STCS: >> http://datastax.com/documentation/cql/3.1/cql/cql_reference/compactSubprop.html >> >> 2. Why not talk to Cassandra when generating the report? It will be waaay >> faster (and easier!); Cassandra will use bloom filters, handle shadowed >> (overwritten) columns, handle tombstones for you, not the mention the fact >> that it uses sstables that are hot in OS file cache. >> >> 3. See 2) above. Also, your approach requires you to implement handling of >> shadowed columns as well as tombstone handling which could be pretty messy. >> >> Cheers, >> Jens >> >> ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se >> Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter >> >> >> On Mon, Jan 26, 2015 at 7:40 AM, Parth Setya <setya.pa...@gmail.com> >> wrote: >> >>> Hi >>> >>> >>> *Setup* >>> >>> *3 Node Cluster* >>> Api- >>> * Hector*CL- >>> * QUORUM* >>> RF- >>> *3* >>> Compaction Strategy- >>> *Size Tiered Compaction* >>> >>> *Use Case* >>> I have about *320 million rows*(~12 to 15 columns each) worth of data >>> stored in Cassandra. In order to generate a report containing ALL that >>> data, I do the following: >>> 1. Run Compaction >>> 2. Take a snapshot of the db >>> 3. Run sstable2json on all the *Data.db files >>> 4. Read those jsons and write to a csv. >>> >>> *Problem*: >>> The *sstable2json* utility takes about 350-400 hours (~85% of the total >>> time) thereby lengthening the process. (I am running sstable2json >>> sequentially on all the *Data.db files but the size of those is >>> inconsistent so making it run concurrently doesn't help either E.G one file >>> is of size 25 GB while another of 500 MB) >>> >>> *My Thought Process:* >>> Is there a way to put a cap on the maximum size of the sstables that are >>> generated after compaction such that i have multiple sstables of uniform >>> size. Then I can run sstable2json utility on the same concurrently >>> >>> *Questions:* >>> 1. Is there a way to configure the size of sstables created after >>> compaction? >>> 2. Is there a better approach to generate the report? >>> 3. What are the flaws with this approach? >>> >>> Best >>> Parth >>> >>> >>