Your use case requires messages to pushed out when time comes instead of
the order in which they arrived, while kafka may not be best for this as
within the Q you want some message batch to be sent out early and some
later. There could be another way to solve this with offset management as
kafka is mainly pull based and you can pull a message from the offset you
desire which can solve your case without dealing with new topics per day or
a T time window.

If you considered a fixed time window of T which may be 6 mins on producer
side as one big chunk then you could have your producer send a window begin
and window end boundary message sent to all partitions in the topic. Now as
you said that particular chunk will require to be sent when it's time
comes, you could simply have a message player on your consumer side which
would just look for the window messages and keep a note of the offset of
these messages. These offsets may be mapped to a trigger time which is when
you want start sending that complete batch. Your consumers are triggered
with the required offsets at different times and they just go and pull out
the big batch within the start and end boundary. As the same start and end
boundary is across all partitions, it doesnt matter which partition your
messages landed on. The only issue i see with this is how do you maintain a
message expiry, if you have a use case which ensures that you can only
schedule n days ahead, then you set expiry as n days and you will not have
to deal with deletion at all and same topic with multiple partitions would
get you what you need.


On Tue, Aug 12, 2014 at 8:04 AM, Chen Wang <chen.apache.s...@gmail.com>
wrote:

> Those data has a timestamp: its actually email campaigns with scheduled
> send time. But  since they can be scheduled ahead(e.g, two days ahead), I
> cannot read it when it arrives. It has to wait until its actual scheduled
> send time. As you can tell, the sequence within the 6 min does not matter,
> but it does matter to ensure messages in the first 6 min will be sent out
> before the next 6 min, thus reading it at the end of every 6 minutes.
>
> Therefore it's hard for me to put the messages into different partitions of
> a single topic, becoz  6 min data might already too big for a single
> partition, let alone the offset management chaos.
>
> It seems that a fast queue system is the right tool to use, but it involves
> more setup and cluster maintenance overhead. My thought is to use the
> existing kafka cluster, with the hope that the topic deletion api will be
> available soon.Meantime just have a cron job cleaning up the outdated
> topics from zookeeper.
>
> Let me know what you think,
> Thanks,
> Chen
>
>
> On Mon, Aug 11, 2014 at 6:53 PM, Philip O'Toole <
> philip.oto...@yahoo.com.invalid> wrote:
>
> > Why do you need to read it every 6 minutes? Why not just read it as it
> > arrives? If it naturally arrives in 6 minute bursts, you'll read it in 6
> > minute bursts, no?
> >
> > Perhaps the data does not have timestamps embedded in it, so that is why
> > you are relying on time-based topic names? In that case I would have an
> > intermediate stage that tags the data with the timestamp, and then writes
> > it to a single topic, and then processes it at your leisure in a third
> > stage.
> >
> > Perhaps I am still missing a key difficulty with your system.
> >
> > Your original suggestion is going to be difficult to get working. You'll
> > quickly run out of file descriptors, amongst other issues.
> >
> > Philip
> >
> >
> >
> >
> > ---------------------------------
> > http://www.philipotoole.com
> >
> > > On Aug 11, 2014, at 6:42 PM, Chen Wang <chen.apache.s...@gmail.com>
> > wrote:
> > >
> > > "And if you can't consume it all within 6 minutes, partition the topic
> > > until you can run enough consumers such that you can keep up.", this is
> > > what I intend to do for each 6min -topic.
> > >
> > > What I really need is a partitioned queue: each 6 minute of data can
> put
> > > into a separate partition, so that I can read that specific partition
> at
> > > the end of each 6 minutes. So apparently redis naturally fit this case,
> > but
> > > the only issue is the performance,(well also some trick in ensuring the
> > > reliable message delivery). As I said, we have kakfa infrastructure in
> > > place, if without too much work, i can make the design work with
> kafka, i
> > > would rather go this path instead of setting up another queue system.
> > >
> > > Chen
> > >
> > > Chen
> > >
> > >
> > > On Mon, Aug 11, 2014 at 6:07 PM, Philip O'Toole <
> > > philip.oto...@yahoo.com.invalid> wrote:
> > >
> > >> It's still not clear to me why you need to create so many topics.
> > >>
> > >> Write the data to a single topic and consume it when it arrives. It
> > >> doesn't matter if it arrives in bursts, as long as you can process it
> > all
> > >> within 6 minutes, right?
> > >>
> > >> And if you can't consume it all within 6 minutes, partition the topic
> > >> until you can run enough consumers such that you can keep up. The fact
> > that
> > >> you are thinking about so many topics is a sign your design is wrong,
> or
> > >> Kafka is the wrong solution.
> > >>
> > >> Philip
> > >>
> > >>>> On Aug 11, 2014, at 5:18 PM, Chen Wang <chen.apache.s...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> Philip,
> > >>> That is right. There is huge amount of data flushed into the topic
> > >> within each 6 minutes. Then at the end of each 6 min, I only want to
> > read
> > >> from that specify topic, and data within that topic has to be
> processed
> > as
> > >> fast as possible. I was originally using redis queue for this purpose,
> > but
> > >> it takes much longer to process a redis queue than kafka queue(testing
> > data
> > >> is 2M messages). Since we already have kafka infrastructure setup,
> > instead
> > >> of seeking other tools(activeMQ, rabbitMQ etc), I would rather make
> use
> > of
> > >> kafka, although it does not seem like a common kafka user case.
> > >>>
> > >>> Chen
> > >>>
> > >>>
> > >>>> On Mon, Aug 11, 2014 at 5:01 PM, Philip O'Toole
> > >> <philip.oto...@yahoo.com.invalid> wrote:
> > >>>> I'd love to know more about what you're trying to do here. It sounds
> > >> like you're trying to create topics on a schedule, trying to make it
> > easy
> > >> to locate data for a given time range? I'm not sure it makes sense to
> > use
> > >> Kafka in this manner.
> > >>>>
> > >>>> Can you provide more detail?
> > >>>>
> > >>>>
> > >>>> Philip
> > >>>>
> > >>>>
> > >>>> -----------------------------------------
> > >>>> http://www.philipotoole.com
> > >>>>
> > >>>>
> > >>>> On Monday, August 11, 2014 4:45 PM, Chen Wang <
> > >> chen.apache.s...@gmail.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> Todd,
> > >>>> I actually only intend to keep each topic valid for 3 days most.
> Each
> > of
> > >>>> our topic has 3 partitions, so its around 3*240*3 =2160 partitions.
> > >> Since
> > >>>> there is no api for deleting topic, i guess i could set up a cron
> job
> > >>>> deleting the out dated topics(folders) from zookeeper..
> > >>>> do you know when the delete topic api will be available in kafka?
> > >>>> Chen
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Aug 11, 2014 at 3:47 PM, Todd Palino
> > >> <tpal...@linkedin.com.invalid>
> > >>>> wrote:
> > >>>>
> > >>>>> You need to consider your total partition count as you do this.
> After
> > >> 30
> > >>>>> days, assuming 1 partition per topic, you have 7200 partitions.
> > >> Depending
> > >>>>> on how many brokers you have, this can start to be a problem. We
> just
> > >>>>> found an issue on one of our clusters that has over 70k partitions
> > >> that
> > >>>>> there¹s now a problem with doing actions like a preferred replica
> > >> election
> > >>>>> for all topics because the JSON object that gets written to the
> > >> zookeeper
> > >>>>> node to trigger it is too large for Zookeeper¹s default 1 MB data
> > >> size.
> > >>>>>
> > >>>>> You also need to think about the number of open file handles. Even
> > >> with no
> > >>>>> data, there will be open files for each topic.
> > >>>>>
> > >>>>> -Todd
> > >>>>>
> > >>>>>
> > >>>>>> On 8/11/14, 2:19 PM, "Chen Wang" <chen.apache.s...@gmail.com>
> > wrote:
> > >>>>>>
> > >>>>>> Folks,
> > >>>>>> Is there any potential issue with creating 240 topics every day?
> > >> Although
> > >>>>>> the retention of each topic is set to be 2 days, I am a little
> > >> concerned
> > >>>>>> that since right now there is no delete topic api, the zookeepers
> > >> might be
> > >>>>>> overloaded.
> > >>>>>> Thanks,
> > >>>>>> Chen
> > >>
> >
>

Reply via email to