Thanks Nat for your ideas.

>This could be as simple as adding year and month to the primary key (in
the form >'yyyymm'). Alternatively, you could add this in the partition in
the definition. Either way, it >then becomes pretty easy to re-generate
these based on the query parameters.

 The thing is that it's not that simple. My customer has a very BAD idea,
using Cassandra as a queue (the perfect anti-pattern ever).

 Before trying to tell them to redesign their entire architecture and put
in some queueing system like ActiveMQ or something similar, I would like to
see how I can use wide rows to meet the requirements.

 The functional need is quite simple:

 1) A process A loads users into Cassandra and sets the status on this user
to be 'TODO'. When using the bucketing technique, we can limit a row width
to, let's say 100 000 columns. So at the end of the current row, process A
knows that it should move to next bucket. Bucket is coded using *composite
partition key*, in our example it would be 'TODO:1', 'TODO:2' .... etc


 2) A process B reads the wide row for 'TODO' status. It starts at bucket 1
so it will read row with partition key 'TODO:1'. The users are processed
and inserted in a new row 'PROCESSED:1' for example to keep track of the
status. After retrieving 100 000 columns, it will switch automatically to
the next bucket. Simple. Fair enough


 3) Now what sucks it that some time, process B does not have enough data
to perform functional logic on the user it fetched from the wide row, so it
has to REPUT some users back into the 'TODO' status rather than
transitioning to 'PROCESSED' status. That's exactly a queue behavior.

 A simplistic idea would be to insert again those *m* users with 'TODO:*n*',
with *n* higher than the current bucket number so it can be processed
later. *But then it screws up all the counting system*. Process A which
inserts data will not know that there are already *m* users in row *n*, so
will happily add 100 000 columns, making the row size grow to  *100 000 +
m. *When process B reads back again this row, it will stop at the first 100
000 columns and skip the trailing *m* elements .

  That 's the main reason for which I dropped the idea of bucketing (which
is quite smart in normal case) to trade for ultra wide row.

 Any way, I'll follow your advice and play around with the parameters of
SizeTiered

 Regards

 Duy Hai DOAN


On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <n...@thelastpickle.com> wrote:

>
>>  The only drawback for ultra wide row I can see is point 1). But if I use
>> leveled compaction with a sufficiently large value for "sstable_size_in_mb"
>> (let's say 200Mb), will my read performance be impacted as the row grows ?
>>
>
> For this use case, you would want to use SizeTieredCompaction and play
> around with the configuration a bit to keep a small number of large
> SSTables. Specifically: keep min|max_threshold really low, set bucket_low
> and bucket_high closer together maybe even both to 1.0, and maybe a larger
> min_sstable_size.
>
> YMMV though - per Rob's suggestion, take the time to run some tests
> tweaking these options.
>
>
>>
>>  Of course, splitting wide row into several rows using bucketing
>> technique is one solution but it forces us to keep track of the bucket
>> number and it's not convenient. We have one process (jvm) that insert data
>> and another process (jvm) that read data. Using bucketing, we need to
>> synchronize the bucket number between the 2 processes.
>>
>>
> This could be as simple as adding year and month to the primary key (in
> the form 'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it then becomes pretty easy to re-generate
> these based on the query parameters.
>
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Reply via email to