Re: understanding partition key

Gary Ogden Thu, 12 Feb 2015 09:17:31 -0800

Thanks.

I was under the impression that you can't process the data as each record
comes into the topic? That Kafka doesn't support pushing messages to a
consumer like a traditional message queue?


On 12 February 2015 at 11:06, David McNelis <dmcne...@emergingthreats.net>
wrote:

> I'm going to go a bit in reverse for your questions. We built a restful API
> to push data to so that we could submit things from multiple sources that
> aren't necessarily things that our team would maintain, as well as validate
> that data before we send it off to a topic.
>
> As for consumers... we expect that a consumer will receive everything in a
> particular partition, but definitely not everything in a topic.  For
> example, for DNS information, a consumer expects to see everything for a
> domain (our partition key there), but not for all domains... it would be
> too much data to handle with a low enough latency.
>
> For us, our latencies vary by what we're processing at a given moment or
> other large jobs that might be running outside of the stream processing.
> Generally speaking we keep latency within our limits by adjusting the
> number of partitions for a topic, and subsequently the number of
> consumers...which works until you hit a wall with your data store, in which
> case tweaking must happen there too.
>
> Your flow of data will be the biggest impact on latency (along with batch
> size)....  If you see 100 events a minute, and have a batch size of 1k,
> you'll have 10 minutes of latency... if you process your data as each
> record comes in, then as long as your consumer can keep up with that load,
> you'll have very low latency.
>
> Kafka's latency has never been an issue for us, and we've always found
> there's something that we're doing that is affecting the overall throughput
> of the systems, be it needing to play with the number of partitions,
> adjusting batch size, resizing the hadoop cluster to meet increased need
> there.
>
> On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <gog...@gmail.com> wrote:
>
> > Thanks again David. So what kind of latencies are you experiencing with
> > this? If I wanted to act upon certain events in this and send out alarms
> > (email, sms etc), what kind of delays are you seeing by the time you're
> > able to process them?
> >
> > It seems if you were to create an alarm topic, and dump alerts on there
> to
> > be processed, and have a consumer then process those alerts (save to
> > Cassandra and send out the notification), you could see some sort of
> delay.
> >
> > But none of you consumers are expecting that they will be getting all the
> > data for that topic, are they? That's the hitch for me. I think I could
> > rework the design so that no consumer would assume it's getting all the
> > messages in the topic.
> >
> > When you say central producer stack, this is something you built outside
> of
> > kafka I assume.
> >
> > On 12 February 2015 at 09:40, David McNelis <
> dmcne...@emergingthreats.net>
> > wrote:
> >
> > > In our setup we deal with a similar situation (lots of time-series data
> > > that we have to aggregate in a number of different ways).
> > >
> > > Our approach is to push all of our data to a central producer stack,
> that
> > > stack then submits data to different topics, depending on a set of
> > > predetermined rules.
> > >
> > > For arguments sake, lets say we see 100 events / second... Once the
> data
> > is
> > > pushed to topics we have several consumer applications.  The first
> pushes
> > > the lowest level raw data into our cluster (in our case HBase, but
> could
> > > just as easily be C*). This first consumer we use a small batch
> size....
> > > say 100 records.
> > >
> > > Our second consumer does some aggregation and summarization and needs
> to
> > > reference additional resources.  This batch size is considerably
> larger,
> > > say 1000 records in size.
> > >
> > > Our third consumer does a much larger aggregation where we're reducing
> it
> > > down to a much smaller dataset, by orders of magnitude, and we'll run
> > that
> > > batch size at 10k records.
> > >
> > > Our aggregations are oftentimes time based, like rolling events /
> > counters
> > > to the hour or day level... so while we may occasionally re-process
> some
> > of
> > > the same information throughout the day, it lets us maintain a system
> > where
> > > we have near-real-time access to most of the data we're ingesting.
> > >
> > > This certainly is something we've had to tweak in terms of the numbers
> of
> > > consumers / partitions and batch sizes to get to work optimally, but
> > we're
> > > pretty happy with it at the moment.
> > >
> > > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <gog...@gmail.com> wrote:
> > >
> > > > Thanks David. Whether Kafka is the right choice is exactly what I'm
> > > trying
> > > > to determine.  Everything I want to do with these events is time
> based.
> > > > Store them in the topic for 24 hours. Read from the topics and get
> data
> > > for
> > > > a time period (last hour , last 8 hours etc). This reading from the
> > > topics
> > > > could happen numerous times for the same dataset.  At the end of the
> 24
> > > > hours, we would then want to store the events in long term storage in
> > > case
> > > > we need them in the future.
> > > >
> > > > I'm thinking Kafka isn't the right fit for this use case though.
> > > However,
> > > > we have cassandra already installed and running. Maybe we use kafka
> as
> > > the
> > > > in-between... So dump the events onto kafka, have consumers that
> > process
> > > > the events and dump them into Cassandra and then we read the the time
> > > > series data we need from Cassandra and not kafka.
> > > >
> > > > My concern with this method would be two things:
> > > > 1 - the time delay of dumping them on the queue and then processing
> > them
> > > > into Cassandra before the events are available.
> > > > 2 - Is there any need for kafka now? Why can't the code that puts the
> > > > messages on the topic just directly insert into Cassandra then and
> > avoid
> > > > this extra hop?
> > > >
> > > >
> > > >
> > > > On 12 February 2015 at 09:01, David McNelis <
> > > dmcne...@emergingthreats.net>
> > > > wrote:
> > > >
> > > > > Gary,
> > > > >
> > > > > That is certainly a valid use case.  What Zijing was saying is that
> > you
> > > > can
> > > > > only have 1 consumer per consumer application per partition.
> > > > >
> > > > > I think that what it boils down to is how you want your information
> > > > grouped
> > > > > inside your timeframes.  For example, if you want to have
> everything
> > > for
> > > > a
> > > > > specific user, then you could use that as your partition key,
> > ensuring
> > > > that
> > > > > any data for that user is processed by the same consumer (in
> however
> > > many
> > > > > consumer applications you opt to run).
> > > > >
> > > > > The second point that I think Zijing was getting at was whether or
> > not
> > > > your
> > > > > proposed use case makes sense for Kafka.  If your goal is to do
> > > > > time-interval batch processing (versus N-record batches), then why
> > use
> > > > > Kafka for it?  Why not use something more adept at batch
> processing?
> > > For
> > > > > example, if you're using HBase you can use Pig jobs that would read
> > > only
> > > > > the records created between specific timestamps.
> > > > >
> > > > > David
> > > > >
> > > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <gog...@gmail.com>
> > wrote:
> > > > >
> > > > > > So it's not possible to have 1 topic with 1 partition and many
> > > > consumers
> > > > > of
> > > > > > that topic?
> > > > > >
> > > > > > My intention is to have a topic with many consumers, but each
> > > consumer
> > > > > > needs to be able to have access to all the messages in that
> topic.
> > > > > >
> > > > > > On 11 February 2015 at 20:42, Zijing Guo
> > <alter...@yahoo.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Partition key is on producer level, that if you have multiple
> > > > > partitions
> > > > > > > for a single topic, then you can pass in a key for the
> > KeyedMessage
> > > > > > object,
> > > > > > > and base on different partition.class, it will return a
> partition
> > > > > number
> > > > > > > for the producer, and producer will find the leader for that
> > > > > partition.I
> > > > > > > don't know how kafka could handle time series case, but depends
> > on
> > > > how
> > > > > > many
> > > > > > > partitions for that topic. If you only have 1 partition, then
> you
> > > > don't
> > > > > > > need to worry about order at all, since each consumer group can
> > > only
> > > > > > allow
> > > > > > > 1 consumer instance to consume that data.  if you have multiple
> > > > > > partitions
> > > > > > > (say 3 for example), then you can fire up 3 consumer instances
> > > under
> > > > > the
> > > > > > > same consumer group, and each will only consume 1 partition's
> > data.
> > > > if
> > > > > > > order in each partition matters, then you need to do some work
> on
> > > the
> > > > > > > producer side.Hope this helpsEdwin
> > > > > > >
> > > > > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > > > > gog...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >  I'm trying to understand how the partition key works and
> > whether I
> > > > > need
> > > > > > to
> > > > > > > specify a partition key for my topics or not.  What happens if
> I
> > > > don't
> > > > > > > specify a PK and I have more than one consumer that wants all
> > > > messages
> > > > > > in a
> > > > > > > topic for a certain period of time? Will those consumers get
> all
> > > the
> > > > > > > messages, but they just may not be ordered correctly?
> > > > > > >
> > > > > > > The current scenario is that we will have events going into a
> > topic
> > > > > based
> > > > > > > on customer and the data will remain in the topic for 24 hours.
> > We
> > > > will
> > > > > > > then have multiple consumers reading messages from that topic.
> > They
> > > > > will
> > > > > > > want to be able to get them out over a time range (could be
> last
> > > > hour,
> > > > > > last
> > > > > > > 8 hours etc).
> > > > > > >
> > > > > > > So if I specify the same PK for each subscriber, then each
> > consumer
> > > > > will
> > > > > > > get all messages in the correct order?  If I don't specify the
> PK
> > > or
> > > > > use
> > > > > > a
> > > > > > > random one, will each consumer still get all the messages but
> > they
> > > > just
> > > > > > > won't be ordered correctly?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: understanding partition key

Reply via email to