What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets.
Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote: > Not if you add another column to the partition key; source for example. > > I would really try to stay away from the ordered partitioner if at all > possible. > > What ingestion rates are you expecting, in size and speed. > > -- > Colin > 320-221-9531 > > > On Jun 7, 2014, at 9:05 PM, Kevin Burton <bur...@spinn3r.com> wrote: > > > Thanks for the feedback on this btw.. .it's helpful. My notes below. > > On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote: > >> No, you're not-the partition key will get distributed across the cluster >> if you're using random or murmur. >> > > Yes… I'm aware. But in practice this is how it will work… > > If we create bucket b0, that will get hashed to h0… > > So say I have 50 machines performing writes, they are all on the same time > thanks to ntpd, so they all compute b0 for the current bucket based on the > time. > > That gets hashed to h0… > > If h0 is hosted on node0 … then all writes go to node zero for that 1 > second interval. > > So all my writes are bottlenecking on one node. That node is *changing* > over time… but they're not being dispatched in parallel over N nodes. At > most writes will only ever reach 1 node a time. > > > >> You could also ensure that by adding another column, like source to >> ensure distribution. (Add the seconds to the partition key, not the >> clustering columns) >> >> I can almost guarantee that if you put too much thought into working >> against what Cassandra offers out of the box, that it will bite you later. >> >> > Sure.. I'm trying to avoid the 'bite you later' issues. More so because > I'm sure there are Cassandra gotchas to worry about. Everything has them. > Just trying to avoid the land mines :-P > > >> In fact, the use case that you're describing may best be served by a >> queuing mechanism, and using Cassandra only for the underlying store. >> > > Yes… that's what I'm doing. We're using apollo to fan out the queue, but > the writes go back into cassandra and needs to be read out sequentially. > > >> >> I used this exact same approach in a use case that involved writing over >> a million events/second to a cluster with no problems. Initially, I >> thought ordered partitioner was the way to go too. And I used separate >> processes to aggregate, conflate, and handle distribution to clients. >> > > > Yes. I think using 100 buckets will work for now. Plus I don't have to > change the partitioner on our existing cluster and I'm lazy :) > > >> >> Just my two cents, but I also spend the majority of my days helping >> people utilize Cassandra correctly, and rescuing those that haven't. >> >> > Definitely appreciate the feedback! Thanks! > > -- > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > Skype: *burtonator* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > <http://spinn3r.com> > War is peace. Freedom is slavery. Ignorance is strength. Corporations are > people. > > -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts> <http://spinn3r.com> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.