Re: Data model for streaming a large table in real time.

Kevin Burton Sat, 07 Jun 2014 19:19:26 -0700

What's 'source' ? You mean like the URL?

If source too random it's going to yield too many buckets.


Ingestion rates are fairly high but not insane.  About 4M inserts per
hour.. from 5-10GB…


On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote:

> Not if you add another column to the partition key; source for example.
>
> I would really try to stay away from the ordered partitioner if at all
> possible.
>
> What ingestion rates are you expecting, in size and speed.
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 9:05 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>
> Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>
> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote:
>
>> No, you're not-the partition key will get distributed across the cluster
>> if you're using random or murmur.
>>
>
> Yes… I'm aware.  But in practice this is how it will work…
>
> If we create bucket b0, that will get hashed to h0…
>
> So say I have 50 machines performing writes, they are all on the same time
> thanks to ntpd, so they all compute b0 for the current bucket based on the
> time.
>
> That gets hashed to h0…
>
> If h0 is hosted on node0 … then all writes go to node zero for that 1
> second interval.
>
> So all my writes are bottlenecking on one node.  That node is *changing*
> over time… but they're not being dispatched in parallel over N nodes.  At
> most writes will only ever reach 1 node a time.
>
>
>
>> You could also ensure that by adding another column, like source to
>> ensure distribution. (Add the seconds to the partition key, not the
>> clustering columns)
>>
>> I can almost guarantee that if you put too much thought into working
>> against what Cassandra offers out of the box, that it will bite you later.
>>
>>
> Sure.. I'm trying to avoid the 'bite you later' issues. More so because
> I'm sure there are Cassandra gotchas to worry about.  Everything has them.
>  Just trying to avoid the land mines :-P
>
>
>> In fact, the use case that you're describing may best be served by a
>> queuing mechanism, and using Cassandra only for the underlying store.
>>
>
> Yes… that's what I'm doing.  We're using apollo to fan out the queue, but
> the writes go back into cassandra and needs to be read out sequentially.
>
>
>>
>> I used this exact same approach in a use case that involved writing over
>> a million events/second to a cluster with no problems.  Initially, I
>> thought ordered partitioner was the way to go too.  And I used separate
>> processes to aggregate, conflate, and handle distribution to clients.
>>
>
>
> Yes. I think using 100 buckets will work for now.  Plus I don't have to
> change the partitioner on our existing cluster and I'm lazy :)
>
>
>>
>> Just my two cents, but I also spend the majority of my days helping
>> people utilize Cassandra correctly, and rescuing those that haven't.
>>
>>
> Definitely appreciate the feedback!  Thanks!
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Reply via email to