Re: Data model for streaming a large table in real time.

Kevin Burton Sat, 07 Jun 2014 20:02:23 -0700

Right now I'm just putting everything together as a proof of concept… so
just two cheap replicas for now.  And it's at 1/10000th of the load.


If we lose data it's ok :)

I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores,
probably 48-64GB of RAM each box.

Just one datacenter for now…

We're probably going to be migrating to using linux containers at some
point.  This way we can have like 16GB , one 400GB SSD, 4 cores for each
image.  And we can ditch the RAID which is nice. :)


On Sat, Jun 7, 2014 at 7:51 PM, Colin <colpcl...@gmail.com> wrote:

> To have any redundancy in the system, start with at least 3 nodes and a
> replication factor of 3.
>
> Try to have at least 8 cores, 32 gig ram, and separate disks for log and
> data.
>
> Will you be replicating data across data centers?
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 9:40 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
> Oh.. To start with we're going to use from 2-10 nodes..
>
> I think we're going to take the original strategy and just to use 100
> buckets .. 0-99… then the timestamp under that..  I think it should be fine
> and won't require an ordered partitioner. :)
>
> Thanks!
>
>
> On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <co...@clark.ws> wrote:
>
>> With 100 nodes, that ingestion rate is actually quite low and I don't
>> think you'd need another column in the partition key.
>>
>> You seem to be set in your current direction.  Let us know how it works
>> out.
>>
>> --
>> Colin
>> 320-221-9531
>>
>>
>> On Jun 7, 2014, at 9:18 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>> What's 'source' ? You mean like the URL?
>>
>> If source too random it's going to yield too many buckets.
>>
>> Ingestion rates are fairly high but not insane.  About 4M inserts per
>> hour.. from 5-10GB…
>>
>>
>> On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote:
>>
>>> Not if you add another column to the partition key; source for example.
>>>
>>> I would really try to stay away from the ordered partitioner if at all
>>> possible.
>>>
>>> What ingestion rates are you expecting, in size and speed.
>>>
>>> --
>>> Colin
>>> 320-221-9531
>>>
>>>
>>> On Jun 7, 2014, at 9:05 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>>
>>>
>>> Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>>>
>>> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote:
>>>
>>>> No, you're not-the partition key will get distributed across the
>>>> cluster if you're using random or murmur.
>>>>
>>>
>>> Yes… I'm aware.  But in practice this is how it will work…
>>>
>>> If we create bucket b0, that will get hashed to h0…
>>>
>>> So say I have 50 machines performing writes, they are all on the same
>>> time thanks to ntpd, so they all compute b0 for the current bucket based on
>>> the time.
>>>
>>> That gets hashed to h0…
>>>
>>> If h0 is hosted on node0 … then all writes go to node zero for that 1
>>> second interval.
>>>
>>> So all my writes are bottlenecking on one node.  That node is *changing*
>>> over time… but they're not being dispatched in parallel over N nodes.  At
>>> most writes will only ever reach 1 node a time.
>>>
>>>
>>>
>>>> You could also ensure that by adding another column, like source to
>>>> ensure distribution. (Add the seconds to the partition key, not the
>>>> clustering columns)
>>>>
>>>> I can almost guarantee that if you put too much thought into working
>>>> against what Cassandra offers out of the box, that it will bite you later.
>>>>
>>>>
>>> Sure.. I'm trying to avoid the 'bite you later' issues. More so because
>>> I'm sure there are Cassandra gotchas to worry about.  Everything has them.
>>>  Just trying to avoid the land mines :-P
>>>
>>>
>>>> In fact, the use case that you're describing may best be served by a
>>>> queuing mechanism, and using Cassandra only for the underlying store.
>>>>
>>>
>>> Yes… that's what I'm doing.  We're using apollo to fan out the queue,
>>> but the writes go back into cassandra and needs to be read out sequentially.
>>>
>>>
>>>>
>>>> I used this exact same approach in a use case that involved writing
>>>> over a million events/second to a cluster with no problems.  Initially, I
>>>> thought ordered partitioner was the way to go too.  And I used separate
>>>> processes to aggregate, conflate, and handle distribution to clients.
>>>>
>>>
>>>
>>> Yes. I think using 100 buckets will work for now.  Plus I don't have to
>>> change the partitioner on our existing cluster and I'm lazy :)
>>>
>>>
>>>>
>>>> Just my two cents, but I also spend the majority of my days helping
>>>> people utilize Cassandra correctly, and rescuing those that haven't.
>>>>
>>>>
>>> Definitely appreciate the feedback!  Thanks!
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations
>>> are people.
>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
>> people.
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Reply via email to