Hi Parth,



I’ll take your questions in order:




1. Have a look at the compaction subproperties for STCS: 
http://datastax.com/documentation/cql/3.1/cql/cql_reference/compactSubprop.html


2. Why not talk to Cassandra when generating the report? It will be waaay 
faster (and easier!); Cassandra will use bloom filters, handle shadowed 
(overwritten) columns, handle tombstones for you, not the mention the fact that 
it uses sstables that are hot in OS file cache.




3. See 2) above. Also, your approach requires you to implement handling of 
shadowed columns as well as tombstone handling which could be pretty messy.




Cheers,

Jens


———
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter

On Mon, Jan 26, 2015 at 7:40 AM, Parth Setya <setya.pa...@gmail.com>
wrote:

> Hi
> *Setup*
> *3 Node Cluster*
> Api-
> * Hector*CL-
> * QUORUM*
> RF-
> *3*
> Compaction Strategy-
> *Size Tiered Compaction*
> *Use Case*
> I have about *320 million rows*(~12 to 15 columns each) worth of data
> stored in Cassandra. In order to generate a report containing ALL that
> data, I do the following:
> 1. Run Compaction
> 2. Take a snapshot of the db
> 3. Run sstable2json on all the *Data.db files
> 4. Read those jsons and write to a csv.
> *Problem*:
> The *sstable2json* utility takes about 350-400 hours (~85% of the total
> time) thereby lengthening the process. (I am running sstable2json
> sequentially on all the *Data.db files but the size of those is
> inconsistent so making it run concurrently doesn't help either E.G one file
> is of size 25 GB while another of 500 MB)
> *My Thought Process:*
> Is there a way to put a cap on the maximum size of the sstables that are
> generated after compaction such that i have multiple sstables of uniform
> size. Then I can run sstable2json utility on the same concurrently
> *Questions:*
> 1. Is there a way to configure the size of sstables created after
> compaction?
> 2. Is there a better approach to generate the report?
> 3. What are the flaws with this approach?
> Best
> Parth

Reply via email to