Re: Data model for streaming a large table in real time.

Kevin Burton Sat, 07 Jun 2014 21:17:16 -0700

we're using containers for other reasons, not just cassandra.

Tightly constraining resources means we don't have to worry about cassandra
, the JVM , or Linux doing something silly and using too many resources and
taking down the whole box.



On Sat, Jun 7, 2014 at 8:25 PM, Colin Clark <co...@clark.ws> wrote:

> You won't need containers - running one instance of Cassandra in that
> configuration will hum along quite nicely and will make use of the cores
> and memory.
>
> I'd forget the raid anyway and just mount the disks separately (jbod)
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 10:02 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
> Right now I'm just putting everything together as a proof of concept… so
> just two cheap replicas for now.  And it's at 1/10000th of the load.
>
> If we lose data it's ok :)
>
> I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16
> cores, probably 48-64GB of RAM each box.
>
> Just one datacenter for now…
>
> We're probably going to be migrating to using linux containers at some
> point.  This way we can have like 16GB , one 400GB SSD, 4 cores for each
> image.  And we can ditch the RAID which is nice. :)
>
>
> On Sat, Jun 7, 2014 at 7:51 PM, Colin <colpcl...@gmail.com> wrote:
>
>> To have any redundancy in the system, start with at least 3 nodes and a
>> replication factor of 3.
>>
>> Try to have at least 8 cores, 32 gig ram, and separate disks for log and
>> data.
>>
>> Will you be replicating data across data centers?
>>
>> --
>> Colin
>> 320-221-9531
>>
>>
>> On Jun 7, 2014, at 9:40 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>> Oh.. To start with we're going to use from 2-10 nodes..
>>
>> I think we're going to take the original strategy and just to use 100
>> buckets .. 0-99… then the timestamp under that..  I think it should be fine
>> and won't require an ordered partitioner. :)
>>
>> Thanks!
>>
>>
>> On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <co...@clark.ws> wrote:
>>
>>> With 100 nodes, that ingestion rate is actually quite low and I don't
>>> think you'd need another column in the partition key.
>>>
>>> You seem to be set in your current direction.  Let us know how it works
>>> out.
>>>
>>> --
>>> Colin
>>> 320-221-9531
>>>
>>>
>>> On Jun 7, 2014, at 9:18 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>>
>>> What's 'source' ? You mean like the URL?
>>>
>>> If source too random it's going to yield too many buckets.
>>>
>>> Ingestion rates are fairly high but not insane.  About 4M inserts per
>>> hour.. from 5-10GB…
>>>
>>>
>>> On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote:
>>>
>>>> Not if you add another column to the partition key; source for example.
>>>>
>>>>
>>>> I would really try to stay away from the ordered partitioner if at all
>>>> possible.
>>>>
>>>> What ingestion rates are you expecting, in size and speed.
>>>>
>>>> --
>>>> Colin
>>>> 320-221-9531
>>>>
>>>>
>>>> On Jun 7, 2014, at 9:05 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>>>
>>>>
>>>> Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>>>>
>>>> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote:
>>>>
>>>>> No, you're not-the partition key will get distributed across the
>>>>> cluster if you're using random or murmur.
>>>>>
>>>>
>>>> Yes… I'm aware.  But in practice this is how it will work…
>>>>
>>>> If we create bucket b0, that will get hashed to h0…
>>>>
>>>> So say I have 50 machines performing writes, they are all on the same
>>>> time thanks to ntpd, so they all compute b0 for the current bucket based on
>>>> the time.
>>>>
>>>> That gets hashed to h0…
>>>>
>>>> If h0 is hosted on node0 … then all writes go to node zero for that 1
>>>> second interval.
>>>>
>>>> So all my writes are bottlenecking on one node.  That node is
>>>> *changing* over time… but they're not being dispatched in parallel over N
>>>> nodes.  At most writes will only ever reach 1 node a time.
>>>>
>>>>
>>>>
>>>>> You could also ensure that by adding another column, like source to
>>>>> ensure distribution. (Add the seconds to the partition key, not the
>>>>> clustering columns)
>>>>>
>>>>> I can almost guarantee that if you put too much thought into working
>>>>> against what Cassandra offers out of the box, that it will bite you later.
>>>>>
>>>>>
>>>> Sure.. I'm trying to avoid the 'bite you later' issues. More so because
>>>> I'm sure there are Cassandra gotchas to worry about.  Everything has them.
>>>>  Just trying to avoid the land mines :-P
>>>>
>>>>
>>>>> In fact, the use case that you're describing may best be served by a
>>>>> queuing mechanism, and using Cassandra only for the underlying store.
>>>>>
>>>>
>>>> Yes… that's what I'm doing.  We're using apollo to fan out the queue,
>>>> but the writes go back into cassandra and needs to be read out 
>>>> sequentially.
>>>>
>>>>
>>>>>
>>>>> I used this exact same approach in a use case that involved writing
>>>>> over a million events/second to a cluster with no problems.  Initially, I
>>>>> thought ordered partitioner was the way to go too.  And I used separate
>>>>> processes to aggregate, conflate, and handle distribution to clients.
>>>>>
>>>>
>>>>
>>>> Yes. I think using 100 buckets will work for now.  Plus I don't have to
>>>> change the partitioner on our existing cluster and I'm lazy :)
>>>>
>>>>
>>>>>
>>>>> Just my two cents, but I also spend the majority of my days helping
>>>>> people utilize Cassandra correctly, and rescuing those that haven't.
>>>>>
>>>>>
>>>> Definitely appreciate the feedback!  Thanks!
>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> Skype: *burtonator*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>> <http://spinn3r.com>
>>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations
>>>> are people.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations
>>> are people.
>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
>> people.
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Reply via email to