Thanks. I was under the impression that you can't process the data as each record comes into the topic? That Kafka doesn't support pushing messages to a consumer like a traditional message queue?
On 12 February 2015 at 11:06, David McNelis <dmcne...@emergingthreats.net> wrote: > I'm going to go a bit in reverse for your questions. We built a restful API > to push data to so that we could submit things from multiple sources that > aren't necessarily things that our team would maintain, as well as validate > that data before we send it off to a topic. > > As for consumers... we expect that a consumer will receive everything in a > particular partition, but definitely not everything in a topic. For > example, for DNS information, a consumer expects to see everything for a > domain (our partition key there), but not for all domains... it would be > too much data to handle with a low enough latency. > > For us, our latencies vary by what we're processing at a given moment or > other large jobs that might be running outside of the stream processing. > Generally speaking we keep latency within our limits by adjusting the > number of partitions for a topic, and subsequently the number of > consumers...which works until you hit a wall with your data store, in which > case tweaking must happen there too. > > Your flow of data will be the biggest impact on latency (along with batch > size).... If you see 100 events a minute, and have a batch size of 1k, > you'll have 10 minutes of latency... if you process your data as each > record comes in, then as long as your consumer can keep up with that load, > you'll have very low latency. > > Kafka's latency has never been an issue for us, and we've always found > there's something that we're doing that is affecting the overall throughput > of the systems, be it needing to play with the number of partitions, > adjusting batch size, resizing the hadoop cluster to meet increased need > there. > > On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <gog...@gmail.com> wrote: > > > Thanks again David. So what kind of latencies are you experiencing with > > this? If I wanted to act upon certain events in this and send out alarms > > (email, sms etc), what kind of delays are you seeing by the time you're > > able to process them? > > > > It seems if you were to create an alarm topic, and dump alerts on there > to > > be processed, and have a consumer then process those alerts (save to > > Cassandra and send out the notification), you could see some sort of > delay. > > > > But none of you consumers are expecting that they will be getting all the > > data for that topic, are they? That's the hitch for me. I think I could > > rework the design so that no consumer would assume it's getting all the > > messages in the topic. > > > > When you say central producer stack, this is something you built outside > of > > kafka I assume. > > > > On 12 February 2015 at 09:40, David McNelis < > dmcne...@emergingthreats.net> > > wrote: > > > > > In our setup we deal with a similar situation (lots of time-series data > > > that we have to aggregate in a number of different ways). > > > > > > Our approach is to push all of our data to a central producer stack, > that > > > stack then submits data to different topics, depending on a set of > > > predetermined rules. > > > > > > For arguments sake, lets say we see 100 events / second... Once the > data > > is > > > pushed to topics we have several consumer applications. The first > pushes > > > the lowest level raw data into our cluster (in our case HBase, but > could > > > just as easily be C*). This first consumer we use a small batch > size.... > > > say 100 records. > > > > > > Our second consumer does some aggregation and summarization and needs > to > > > reference additional resources. This batch size is considerably > larger, > > > say 1000 records in size. > > > > > > Our third consumer does a much larger aggregation where we're reducing > it > > > down to a much smaller dataset, by orders of magnitude, and we'll run > > that > > > batch size at 10k records. > > > > > > Our aggregations are oftentimes time based, like rolling events / > > counters > > > to the hour or day level... so while we may occasionally re-process > some > > of > > > the same information throughout the day, it lets us maintain a system > > where > > > we have near-real-time access to most of the data we're ingesting. > > > > > > This certainly is something we've had to tweak in terms of the numbers > of > > > consumers / partitions and batch sizes to get to work optimally, but > > we're > > > pretty happy with it at the moment. > > > > > > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <gog...@gmail.com> wrote: > > > > > > > Thanks David. Whether Kafka is the right choice is exactly what I'm > > > trying > > > > to determine. Everything I want to do with these events is time > based. > > > > Store them in the topic for 24 hours. Read from the topics and get > data > > > for > > > > a time period (last hour , last 8 hours etc). This reading from the > > > topics > > > > could happen numerous times for the same dataset. At the end of the > 24 > > > > hours, we would then want to store the events in long term storage in > > > case > > > > we need them in the future. > > > > > > > > I'm thinking Kafka isn't the right fit for this use case though. > > > However, > > > > we have cassandra already installed and running. Maybe we use kafka > as > > > the > > > > in-between... So dump the events onto kafka, have consumers that > > process > > > > the events and dump them into Cassandra and then we read the the time > > > > series data we need from Cassandra and not kafka. > > > > > > > > My concern with this method would be two things: > > > > 1 - the time delay of dumping them on the queue and then processing > > them > > > > into Cassandra before the events are available. > > > > 2 - Is there any need for kafka now? Why can't the code that puts the > > > > messages on the topic just directly insert into Cassandra then and > > avoid > > > > this extra hop? > > > > > > > > > > > > > > > > On 12 February 2015 at 09:01, David McNelis < > > > dmcne...@emergingthreats.net> > > > > wrote: > > > > > > > > > Gary, > > > > > > > > > > That is certainly a valid use case. What Zijing was saying is that > > you > > > > can > > > > > only have 1 consumer per consumer application per partition. > > > > > > > > > > I think that what it boils down to is how you want your information > > > > grouped > > > > > inside your timeframes. For example, if you want to have > everything > > > for > > > > a > > > > > specific user, then you could use that as your partition key, > > ensuring > > > > that > > > > > any data for that user is processed by the same consumer (in > however > > > many > > > > > consumer applications you opt to run). > > > > > > > > > > The second point that I think Zijing was getting at was whether or > > not > > > > your > > > > > proposed use case makes sense for Kafka. If your goal is to do > > > > > time-interval batch processing (versus N-record batches), then why > > use > > > > > Kafka for it? Why not use something more adept at batch > processing? > > > For > > > > > example, if you're using HBase you can use Pig jobs that would read > > > only > > > > > the records created between specific timestamps. > > > > > > > > > > David > > > > > > > > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <gog...@gmail.com> > > wrote: > > > > > > > > > > > So it's not possible to have 1 topic with 1 partition and many > > > > consumers > > > > > of > > > > > > that topic? > > > > > > > > > > > > My intention is to have a topic with many consumers, but each > > > consumer > > > > > > needs to be able to have access to all the messages in that > topic. > > > > > > > > > > > > On 11 February 2015 at 20:42, Zijing Guo > > <alter...@yahoo.com.invalid > > > > > > > > > > wrote: > > > > > > > > > > > > > Partition key is on producer level, that if you have multiple > > > > > partitions > > > > > > > for a single topic, then you can pass in a key for the > > KeyedMessage > > > > > > object, > > > > > > > and base on different partition.class, it will return a > partition > > > > > number > > > > > > > for the producer, and producer will find the leader for that > > > > > partition.I > > > > > > > don't know how kafka could handle time series case, but depends > > on > > > > how > > > > > > many > > > > > > > partitions for that topic. If you only have 1 partition, then > you > > > > don't > > > > > > > need to worry about order at all, since each consumer group can > > > only > > > > > > allow > > > > > > > 1 consumer instance to consume that data. if you have multiple > > > > > > partitions > > > > > > > (say 3 for example), then you can fire up 3 consumer instances > > > under > > > > > the > > > > > > > same consumer group, and each will only consume 1 partition's > > data. > > > > if > > > > > > > order in each partition matters, then you need to do some work > on > > > the > > > > > > > producer side.Hope this helpsEdwin > > > > > > > > > > > > > > On Wednesday, February 11, 2015 3:14 PM, Gary Ogden < > > > > > > gog...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > I'm trying to understand how the partition key works and > > whether I > > > > > need > > > > > > to > > > > > > > specify a partition key for my topics or not. What happens if > I > > > > don't > > > > > > > specify a PK and I have more than one consumer that wants all > > > > messages > > > > > > in a > > > > > > > topic for a certain period of time? Will those consumers get > all > > > the > > > > > > > messages, but they just may not be ordered correctly? > > > > > > > > > > > > > > The current scenario is that we will have events going into a > > topic > > > > > based > > > > > > > on customer and the data will remain in the topic for 24 hours. > > We > > > > will > > > > > > > then have multiple consumers reading messages from that topic. > > They > > > > > will > > > > > > > want to be able to get them out over a time range (could be > last > > > > hour, > > > > > > last > > > > > > > 8 hours etc). > > > > > > > > > > > > > > So if I specify the same PK for each subscriber, then each > > consumer > > > > > will > > > > > > > get all messages in the correct order? If I don't specify the > PK > > > or > > > > > use > > > > > > a > > > > > > > random one, will each consumer still get all the messages but > > they > > > > just > > > > > > > won't be ordered correctly? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >