Hi

*Setup*

*3 Node Cluster*
Api-
* Hector*CL-
* QUORUM*
RF-
*3*
Compaction Strategy-
*Size Tiered Compaction*

*Use Case*
I have about *320 million rows*(~12 to 15 columns each) worth of data
stored in Cassandra. In order to generate a report containing ALL that
data, I do the following:
1. Run Compaction
2. Take a snapshot of the db
3. Run sstable2json on all the *Data.db files
4. Read those jsons and write to a csv.

*Problem*:
The *sstable2json* utility takes about 350-400 hours (~85% of the total
time) thereby lengthening the process. (I am running sstable2json
sequentially on all the *Data.db files but the size of those is
inconsistent so making it run concurrently doesn't help either E.G one file
is of size 25 GB while another of 500 MB)

*My Thought Process:*
Is there a way to put a cap on the maximum size of the sstables that are
generated after compaction such that i have multiple sstables of uniform
size. Then I can run sstable2json utility on the same concurrently

*Questions:*
1. Is there a way to configure the size of sstables created after
compaction?
2. Is there a better approach to generate the report?
3. What are the flaws with this approach?

Best
Parth

Reply via email to