Re: Issue with 240 topics per day

Chen Wang Mon, 11 Aug 2014 20:22:18 -0700

Vipul,
The problem is that the producer does not know when it should set the
window start and window end boundary.. The data does not arrive in order. I
also think its difficult to get the offset of the boundary, and only pull
messages between those boundaries: i am already trying to avoid use the
SimpleConsumer. but apparently what you describe cannot be achieved with
high level consumer group.
Chen



On Mon, Aug 11, 2014 at 8:04 PM, vipul jhawar <vipul.jha...@gmail.com>
wrote:

> Your use case requires messages to pushed out when time comes instead of
> the order in which they arrived, while kafka may not be best for this as
> within the Q you want some message batch to be sent out early and some
> later. There could be another way to solve this with offset management as
> kafka is mainly pull based and you can pull a message from the offset you
> desire which can solve your case without dealing with new topics per day or
> a T time window.
>
> If you considered a fixed time window of T which may be 6 mins on producer
> side as one big chunk then you could have your producer send a window begin
> and window end boundary message sent to all partitions in the topic. Now as
> you said that particular chunk will require to be sent when it's time
> comes, you could simply have a message player on your consumer side which
> would just look for the window messages and keep a note of the offset of
> these messages. These offsets may be mapped to a trigger time which is when
> you want start sending that complete batch. Your consumers are triggered
> with the required offsets at different times and they just go and pull out
> the big batch within the start and end boundary. As the same start and end
> boundary is across all partitions, it doesnt matter which partition your
> messages landed on. The only issue i see with this is how do you maintain a
> message expiry, if you have a use case which ensures that you can only
> schedule n days ahead, then you set expiry as n days and you will not have
> to deal with deletion at all and same topic with multiple partitions would
> get you what you need.
>
>
> On Tue, Aug 12, 2014 at 8:04 AM, Chen Wang <chen.apache.s...@gmail.com>
> wrote:
>
> > Those data has a timestamp: its actually email campaigns with scheduled
> > send time. But  since they can be scheduled ahead(e.g, two days ahead), I
> > cannot read it when it arrives. It has to wait until its actual scheduled
> > send time. As you can tell, the sequence within the 6 min does not
> matter,
> > but it does matter to ensure messages in the first 6 min will be sent out
> > before the next 6 min, thus reading it at the end of every 6 minutes.
> >
> > Therefore it's hard for me to put the messages into different partitions
> of
> > a single topic, becoz  6 min data might already too big for a single
> > partition, let alone the offset management chaos.
> >
> > It seems that a fast queue system is the right tool to use, but it
> involves
> > more setup and cluster maintenance overhead. My thought is to use the
> > existing kafka cluster, with the hope that the topic deletion api will be
> > available soon.Meantime just have a cron job cleaning up the outdated
> > topics from zookeeper.
> >
> > Let me know what you think,
> > Thanks,
> > Chen
> >
> >
> > On Mon, Aug 11, 2014 at 6:53 PM, Philip O'Toole <
> > philip.oto...@yahoo.com.invalid> wrote:
> >
> > > Why do you need to read it every 6 minutes? Why not just read it as it
> > > arrives? If it naturally arrives in 6 minute bursts, you'll read it in
> 6
> > > minute bursts, no?
> > >
> > > Perhaps the data does not have timestamps embedded in it, so that is
> why
> > > you are relying on time-based topic names? In that case I would have an
> > > intermediate stage that tags the data with the timestamp, and then
> writes
> > > it to a single topic, and then processes it at your leisure in a third
> > > stage.
> > >
> > > Perhaps I am still missing a key difficulty with your system.
> > >
> > > Your original suggestion is going to be difficult to get working.
> You'll
> > > quickly run out of file descriptors, amongst other issues.
> > >
> > > Philip
> > >
> > >
> > >
> > >
> > > ---------------------------------
> > > http://www.philipotoole.com
> > >
> > > > On Aug 11, 2014, at 6:42 PM, Chen Wang <chen.apache.s...@gmail.com>
> > > wrote:
> > > >
> > > > "And if you can't consume it all within 6 minutes, partition the
> topic
> > > > until you can run enough consumers such that you can keep up.", this
> is
> > > > what I intend to do for each 6min -topic.
> > > >
> > > > What I really need is a partitioned queue: each 6 minute of data can
> > put
> > > > into a separate partition, so that I can read that specific partition
> > at
> > > > the end of each 6 minutes. So apparently redis naturally fit this
> case,
> > > but
> > > > the only issue is the performance,(well also some trick in ensuring
> the
> > > > reliable message delivery). As I said, we have kakfa infrastructure
> in
> > > > place, if without too much work, i can make the design work with
> > kafka, i
> > > > would rather go this path instead of setting up another queue system.
> > > >
> > > > Chen
> > > >
> > > > Chen
> > > >
> > > >
> > > > On Mon, Aug 11, 2014 at 6:07 PM, Philip O'Toole <
> > > > philip.oto...@yahoo.com.invalid> wrote:
> > > >
> > > >> It's still not clear to me why you need to create so many topics.
> > > >>
> > > >> Write the data to a single topic and consume it when it arrives. It
> > > >> doesn't matter if it arrives in bursts, as long as you can process
> it
> > > all
> > > >> within 6 minutes, right?
> > > >>
> > > >> And if you can't consume it all within 6 minutes, partition the
> topic
> > > >> until you can run enough consumers such that you can keep up. The
> fact
> > > that
> > > >> you are thinking about so many topics is a sign your design is
> wrong,
> > or
> > > >> Kafka is the wrong solution.
> > > >>
> > > >> Philip
> > > >>
> > > >>>> On Aug 11, 2014, at 5:18 PM, Chen Wang <
> chen.apache.s...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> Philip,
> > > >>> That is right. There is huge amount of data flushed into the topic
> > > >> within each 6 minutes. Then at the end of each 6 min, I only want to
> > > read
> > > >> from that specify topic, and data within that topic has to be
> > processed
> > > as
> > > >> fast as possible. I was originally using redis queue for this
> purpose,
> > > but
> > > >> it takes much longer to process a redis queue than kafka
> queue(testing
> > > data
> > > >> is 2M messages). Since we already have kafka infrastructure setup,
> > > instead
> > > >> of seeking other tools(activeMQ, rabbitMQ etc), I would rather make
> > use
> > > of
> > > >> kafka, although it does not seem like a common kafka user case.
> > > >>>
> > > >>> Chen
> > > >>>
> > > >>>
> > > >>>> On Mon, Aug 11, 2014 at 5:01 PM, Philip O'Toole
> > > >> <philip.oto...@yahoo.com.invalid> wrote:
> > > >>>> I'd love to know more about what you're trying to do here. It
> sounds
> > > >> like you're trying to create topics on a schedule, trying to make it
> > > easy
> > > >> to locate data for a given time range? I'm not sure it makes sense
> to
> > > use
> > > >> Kafka in this manner.
> > > >>>>
> > > >>>> Can you provide more detail?
> > > >>>>
> > > >>>>
> > > >>>> Philip
> > > >>>>
> > > >>>>
> > > >>>> -----------------------------------------
> > > >>>> http://www.philipotoole.com
> > > >>>>
> > > >>>>
> > > >>>> On Monday, August 11, 2014 4:45 PM, Chen Wang <
> > > >> chen.apache.s...@gmail.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Todd,
> > > >>>> I actually only intend to keep each topic valid for 3 days most.
> > Each
> > > of
> > > >>>> our topic has 3 partitions, so its around 3*240*3 =2160
> partitions.
> > > >> Since
> > > >>>> there is no api for deleting topic, i guess i could set up a cron
> > job
> > > >>>> deleting the out dated topics(folders) from zookeeper..
> > > >>>> do you know when the delete topic api will be available in kafka?
> > > >>>> Chen
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Aug 11, 2014 at 3:47 PM, Todd Palino
> > > >> <tpal...@linkedin.com.invalid>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> You need to consider your total partition count as you do this.
> > After
> > > >> 30
> > > >>>>> days, assuming 1 partition per topic, you have 7200 partitions.
> > > >> Depending
> > > >>>>> on how many brokers you have, this can start to be a problem. We
> > just
> > > >>>>> found an issue on one of our clusters that has over 70k
> partitions
> > > >> that
> > > >>>>> there¹s now a problem with doing actions like a preferred replica
> > > >> election
> > > >>>>> for all topics because the JSON object that gets written to the
> > > >> zookeeper
> > > >>>>> node to trigger it is too large for Zookeeper¹s default 1 MB data
> > > >> size.
> > > >>>>>
> > > >>>>> You also need to think about the number of open file handles.
> Even
> > > >> with no
> > > >>>>> data, there will be open files for each topic.
> > > >>>>>
> > > >>>>> -Todd
> > > >>>>>
> > > >>>>>
> > > >>>>>> On 8/11/14, 2:19 PM, "Chen Wang" <chen.apache.s...@gmail.com>
> > > wrote:
> > > >>>>>>
> > > >>>>>> Folks,
> > > >>>>>> Is there any potential issue with creating 240 topics every day?
> > > >> Although
> > > >>>>>> the retention of each topic is set to be 2 days, I am a little
> > > >> concerned
> > > >>>>>> that since right now there is no delete topic api, the
> zookeepers
> > > >> might be
> > > >>>>>> overloaded.
> > > >>>>>> Thanks,
> > > >>>>>> Chen
> > > >>
> > >
> >
>

Re: Issue with 240 topics per day

Reply via email to