Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin
I believe Byteorderedpartitioner is being deprecated and for good reason. I would look at what you could achieve by using wide rows and murmur3partitioner. -- Colin 320-221-9531 > On Jun 6, 2014, at 5:27 PM, Kevin Burton wrote: > > We have the requirement to have clients read from our tabl

Re: Data model for streaming a large table in real time.

2014-06-07 Thread DuyHai Doan
"One node would take all the load, followed by the next node" --> with this design, you are not exploiting all the power of the cluster. If only one node takes all the load at a time, what is the point having 20 or 10 nodes ? You'd better off using limited wide row with bucketing to achieve this

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
I just checked the source and in 2.1.0 it's not deprecated. So it *might* be *being* deprecated but I haven't seen anything stating that. On Sat, Jun 7, 2014 at 8:03 AM, Colin wrote: > I believe Byteorderedpartitioner is being deprecated and for good reason. > I would look at what you could a

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark
It's an anti-pattern and there are better ways to do this. I have implemented the paging algorithm you've described using wide rows and bucketing. This approach is a more efficient utilization of Cassandra's built in wholesome goodness. Also, I wouldn't let any number of clients (huge) connect d

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark wrote: > It's an anti-pattern and there are better ways to do this. > > Entirely possible :) It would be nice to have a document with a bunch of common cassandra design patterns. I've been trying to track down a pattern for this and a lot of this is

A list of all potential problems when using byte ordered partitioner?

2014-06-07 Thread Kevin Burton
I believe I'm aware of the problems that can arise due to the byte ordered partitioner. Is there a full list of ALL the problems? I want to make sure I'm not missing anything. The main problems I'm aware of are: ... "natural" inserts where the key is something like a username will tend to have

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
Another way around this is to have a separate table storing the number of buckets. This way if you have too few buckets, you can just increase them in the future. Of course, the older data will still have too few buckets :-( On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton wrote: > > On Sat, Jun

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin
Maybe it makes sense to describe what you're trying to accomplish in more detail. A common bucketing approach is along the lines of year, month, day, hour, minute, etc and then use a timeuuid as a cluster column. Depending upon the semantics of the transport protocol you plan on utilizing, e

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
On Sat, Jun 7, 2014 at 1:34 PM, Colin wrote: > Maybe it makes sense to describe what you're trying to accomplish in more > detail. > > Essentially , I'm appending writes of recent data by our crawler and sending that data to our customers. They need to sync to up to date writes…we need to get th

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin
The add seconds to the bucket. Also, the data will get cached-it's not going to hit disk on every read. Look at the key cache settings on the table. Also, in 2.1 you have even more control over caching. -- Colin 320-221-9531 > On Jun 7, 2014, at 4:30 PM, Kevin Burton wrote: > > >> On Sat

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
well you could add milliseconds, at best you're still bottlenecking most of your writes one one box.. maybe 2-3 if there are ones that are lagging. Anyway.. I think using 100 buckets is probably fine.. Kevin On Sat, Jun 7, 2014 at 2:45 PM, Colin wrote: > The add seconds to the bucket. Also,

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark
No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you

Re: problem removing dead node from ring

2014-06-07 Thread Curious Patient
Hey all, OK I gave removing the downed node from the cassandra ring another try. To recap what's going on, this is what my ring looks like with nodetool status: [root@beta-new:~] #nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark wrote: > No, you're not-the partition key will get distributed across the cluster > if you're using random or murmur. > Yes… I'm aware. But in practice this is how it will work… I

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark
Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton wrote: Thanks fo

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark wrote: > Not if you add another column to the partition key

Object mapper for CQL

2014-06-07 Thread Kevin Burton
Looks like the java-driver is working on an object mapper: "More modules including a simple object mapper will come shortly." But of course I need one now … I'm curious what others are doing here. I don't want to pass around Row objects in my code if I can avoid it.. Ideally I would just run a qu

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark
With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton wrote: What's 'source' ? You

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Cl

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin
To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 > On Jun 7, 2014, at 9:40 PM, Kevin Burton w

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
Right now I'm just putting everything together as a proof of concept… so just two cheap replicas for now. And it's at 1/1th of the load. If we lose data it's ok :) I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores, probably 48-64GB of RAM each box. Just one datacent

Re: Data model for streaming a large table in real time.

2014-06-07 Thread James Campbell
This is a basic question, but having heard that advice before, I'm curious about why the minimum recommended replication factor is three? Certainly additional redundancy, and, I believe, a minimum threshold for paxos. Are there other reasons? On Jun 7, 2014 10:52 PM, Colin wrote: To have any r

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark
Write Consistency Level + Read Consistency Level > Replication Factor ensure your reads will read consistently and having 3 nodes lets you achieve redundancy in event of node failure. So writing with CL of local quorum and reading with CL of local quorum (2+2>3) with replication factor of 3 ensure

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark
You won't need containers - running one instance of Cassandra in that configuration will hum along quite nicely and will make use of the cores and memory. I'd forget the raid anyway and just mount the disks separately (jbod) -- Colin 320-221-9531 On Jun 7, 2014, at 10:02 PM, Kevin Burton wrote

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton
we're using containers for other reasons, not just cassandra. Tightly constraining resources means we don't have to worry about cassandra , the JVM , or Linux doing something silly and using too many resources and taking down the whole box. On Sat, Jun 7, 2014 at 8:25 PM, Colin Clark wrote: >

Re: Object mapper for CQL

2014-06-07 Thread Kuldeep Mishra
There is one High Level Java client for Cassandra which supports CQL is Kundera. You can find it here https://github.com/impetus-opensource/Kundera. Other useful links are https://github.com/impetus-opensource/Kundera/wiki/Getting-Started-in-5-minutes https://github.com/impetus-opensource/Kundera/