Re: Controlling the MAX SIZE of sstables after compaction

Parth Setya Mon, 26 Jan 2015 00:58:05 -0800

hey Jens

Thank you so much for the advise and reading through.
So are you saying that I should query cassandra right away?
If yes, like I mentioned, I have to run this during traffic hours. Isnt
there a possibility then that my traffic to the db may get impacted?
also is it okay to use hector to this?


Best

On Mon, Jan 26, 2015 at 2:19 PM, Jens Rantil <jens.ran...@tink.se> wrote:

> Hi Parth,
>
> I’ll take your questions in order:
>
> 1. Have a look at the compaction subproperties for STCS:
> http://datastax.com/documentation/cql/3.1/cql/cql_reference/compactSubprop.html
>
> 2. Why not talk to Cassandra when generating the report? It will be waaay
> faster (and easier!); Cassandra will use bloom filters, handle shadowed
> (overwritten) columns, handle tombstones for you, not the mention the fact
> that it uses sstables that are hot in OS file cache.
>
> 3. See 2) above. Also, your approach requires you to implement handling of
> shadowed columns as well as tombstone handling which could be pretty messy.
>
> Cheers,
> Jens
>
> ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
>
>
> On Mon, Jan 26, 2015 at 7:40 AM, Parth Setya <setya.pa...@gmail.com>
> wrote:
>
>>  Hi
>>
>>
>> *Setup*
>>
>> *3 Node Cluster*
>> Api-
>> * Hector*CL-
>> * QUORUM*
>> RF-
>> *3*
>> Compaction Strategy-
>> *Size Tiered Compaction*
>>
>> *Use Case*
>> I have about *320 million rows*(~12 to 15 columns each) worth of data
>> stored in Cassandra. In order to generate a report containing ALL that
>> data, I do the following:
>> 1. Run Compaction
>> 2. Take a snapshot of the db
>> 3. Run sstable2json on all the *Data.db files
>> 4. Read those jsons and write to a csv.
>>
>>  *Problem*:
>> The *sstable2json* utility takes about 350-400 hours (~85% of the total
>> time) thereby lengthening the process. (I am running sstable2json
>> sequentially on all the *Data.db files but the size of those is
>> inconsistent so making it run concurrently doesn't help either E.G one file
>> is of size 25 GB while another of 500 MB)
>>
>>  *My Thought Process:*
>> Is there a way to put a cap on the maximum size of the sstables that are
>> generated after compaction such that i have multiple sstables of uniform
>> size. Then I can run sstable2json utility on the same concurrently
>>
>>  *Questions:*
>> 1. Is there a way to configure the size of sstables created after
>> compaction?
>> 2. Is there a better approach to generate the report?
>> 3. What are the flaws with this approach?
>>
>> Best
>> Parth
>>
>>
>

Re: Controlling the MAX SIZE of sstables after compaction

Reply via email to