Vipul, The problem is that the producer does not know when it should set the window start and window end boundary.. The data does not arrive in order. I also think its difficult to get the offset of the boundary, and only pull messages between those boundaries: i am already trying to avoid use the SimpleConsumer. but apparently what you describe cannot be achieved with high level consumer group. Chen
On Mon, Aug 11, 2014 at 8:04 PM, vipul jhawar <vipul.jha...@gmail.com> wrote: > Your use case requires messages to pushed out when time comes instead of > the order in which they arrived, while kafka may not be best for this as > within the Q you want some message batch to be sent out early and some > later. There could be another way to solve this with offset management as > kafka is mainly pull based and you can pull a message from the offset you > desire which can solve your case without dealing with new topics per day or > a T time window. > > If you considered a fixed time window of T which may be 6 mins on producer > side as one big chunk then you could have your producer send a window begin > and window end boundary message sent to all partitions in the topic. Now as > you said that particular chunk will require to be sent when it's time > comes, you could simply have a message player on your consumer side which > would just look for the window messages and keep a note of the offset of > these messages. These offsets may be mapped to a trigger time which is when > you want start sending that complete batch. Your consumers are triggered > with the required offsets at different times and they just go and pull out > the big batch within the start and end boundary. As the same start and end > boundary is across all partitions, it doesnt matter which partition your > messages landed on. The only issue i see with this is how do you maintain a > message expiry, if you have a use case which ensures that you can only > schedule n days ahead, then you set expiry as n days and you will not have > to deal with deletion at all and same topic with multiple partitions would > get you what you need. > > > On Tue, Aug 12, 2014 at 8:04 AM, Chen Wang <chen.apache.s...@gmail.com> > wrote: > > > Those data has a timestamp: its actually email campaigns with scheduled > > send time. But since they can be scheduled ahead(e.g, two days ahead), I > > cannot read it when it arrives. It has to wait until its actual scheduled > > send time. As you can tell, the sequence within the 6 min does not > matter, > > but it does matter to ensure messages in the first 6 min will be sent out > > before the next 6 min, thus reading it at the end of every 6 minutes. > > > > Therefore it's hard for me to put the messages into different partitions > of > > a single topic, becoz 6 min data might already too big for a single > > partition, let alone the offset management chaos. > > > > It seems that a fast queue system is the right tool to use, but it > involves > > more setup and cluster maintenance overhead. My thought is to use the > > existing kafka cluster, with the hope that the topic deletion api will be > > available soon.Meantime just have a cron job cleaning up the outdated > > topics from zookeeper. > > > > Let me know what you think, > > Thanks, > > Chen > > > > > > On Mon, Aug 11, 2014 at 6:53 PM, Philip O'Toole < > > philip.oto...@yahoo.com.invalid> wrote: > > > > > Why do you need to read it every 6 minutes? Why not just read it as it > > > arrives? If it naturally arrives in 6 minute bursts, you'll read it in > 6 > > > minute bursts, no? > > > > > > Perhaps the data does not have timestamps embedded in it, so that is > why > > > you are relying on time-based topic names? In that case I would have an > > > intermediate stage that tags the data with the timestamp, and then > writes > > > it to a single topic, and then processes it at your leisure in a third > > > stage. > > > > > > Perhaps I am still missing a key difficulty with your system. > > > > > > Your original suggestion is going to be difficult to get working. > You'll > > > quickly run out of file descriptors, amongst other issues. > > > > > > Philip > > > > > > > > > > > > > > > --------------------------------- > > > http://www.philipotoole.com > > > > > > > On Aug 11, 2014, at 6:42 PM, Chen Wang <chen.apache.s...@gmail.com> > > > wrote: > > > > > > > > "And if you can't consume it all within 6 minutes, partition the > topic > > > > until you can run enough consumers such that you can keep up.", this > is > > > > what I intend to do for each 6min -topic. > > > > > > > > What I really need is a partitioned queue: each 6 minute of data can > > put > > > > into a separate partition, so that I can read that specific partition > > at > > > > the end of each 6 minutes. So apparently redis naturally fit this > case, > > > but > > > > the only issue is the performance,(well also some trick in ensuring > the > > > > reliable message delivery). As I said, we have kakfa infrastructure > in > > > > place, if without too much work, i can make the design work with > > kafka, i > > > > would rather go this path instead of setting up another queue system. > > > > > > > > Chen > > > > > > > > Chen > > > > > > > > > > > > On Mon, Aug 11, 2014 at 6:07 PM, Philip O'Toole < > > > > philip.oto...@yahoo.com.invalid> wrote: > > > > > > > >> It's still not clear to me why you need to create so many topics. > > > >> > > > >> Write the data to a single topic and consume it when it arrives. It > > > >> doesn't matter if it arrives in bursts, as long as you can process > it > > > all > > > >> within 6 minutes, right? > > > >> > > > >> And if you can't consume it all within 6 minutes, partition the > topic > > > >> until you can run enough consumers such that you can keep up. The > fact > > > that > > > >> you are thinking about so many topics is a sign your design is > wrong, > > or > > > >> Kafka is the wrong solution. > > > >> > > > >> Philip > > > >> > > > >>>> On Aug 11, 2014, at 5:18 PM, Chen Wang < > chen.apache.s...@gmail.com> > > > >>> wrote: > > > >>> > > > >>> Philip, > > > >>> That is right. There is huge amount of data flushed into the topic > > > >> within each 6 minutes. Then at the end of each 6 min, I only want to > > > read > > > >> from that specify topic, and data within that topic has to be > > processed > > > as > > > >> fast as possible. I was originally using redis queue for this > purpose, > > > but > > > >> it takes much longer to process a redis queue than kafka > queue(testing > > > data > > > >> is 2M messages). Since we already have kafka infrastructure setup, > > > instead > > > >> of seeking other tools(activeMQ, rabbitMQ etc), I would rather make > > use > > > of > > > >> kafka, although it does not seem like a common kafka user case. > > > >>> > > > >>> Chen > > > >>> > > > >>> > > > >>>> On Mon, Aug 11, 2014 at 5:01 PM, Philip O'Toole > > > >> <philip.oto...@yahoo.com.invalid> wrote: > > > >>>> I'd love to know more about what you're trying to do here. It > sounds > > > >> like you're trying to create topics on a schedule, trying to make it > > > easy > > > >> to locate data for a given time range? I'm not sure it makes sense > to > > > use > > > >> Kafka in this manner. > > > >>>> > > > >>>> Can you provide more detail? > > > >>>> > > > >>>> > > > >>>> Philip > > > >>>> > > > >>>> > > > >>>> ----------------------------------------- > > > >>>> http://www.philipotoole.com > > > >>>> > > > >>>> > > > >>>> On Monday, August 11, 2014 4:45 PM, Chen Wang < > > > >> chen.apache.s...@gmail.com> wrote: > > > >>>> > > > >>>> > > > >>>> > > > >>>> Todd, > > > >>>> I actually only intend to keep each topic valid for 3 days most. > > Each > > > of > > > >>>> our topic has 3 partitions, so its around 3*240*3 =2160 > partitions. > > > >> Since > > > >>>> there is no api for deleting topic, i guess i could set up a cron > > job > > > >>>> deleting the out dated topics(folders) from zookeeper.. > > > >>>> do you know when the delete topic api will be available in kafka? > > > >>>> Chen > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Mon, Aug 11, 2014 at 3:47 PM, Todd Palino > > > >> <tpal...@linkedin.com.invalid> > > > >>>> wrote: > > > >>>> > > > >>>>> You need to consider your total partition count as you do this. > > After > > > >> 30 > > > >>>>> days, assuming 1 partition per topic, you have 7200 partitions. > > > >> Depending > > > >>>>> on how many brokers you have, this can start to be a problem. We > > just > > > >>>>> found an issue on one of our clusters that has over 70k > partitions > > > >> that > > > >>>>> there¹s now a problem with doing actions like a preferred replica > > > >> election > > > >>>>> for all topics because the JSON object that gets written to the > > > >> zookeeper > > > >>>>> node to trigger it is too large for Zookeeper¹s default 1 MB data > > > >> size. > > > >>>>> > > > >>>>> You also need to think about the number of open file handles. > Even > > > >> with no > > > >>>>> data, there will be open files for each topic. > > > >>>>> > > > >>>>> -Todd > > > >>>>> > > > >>>>> > > > >>>>>> On 8/11/14, 2:19 PM, "Chen Wang" <chen.apache.s...@gmail.com> > > > wrote: > > > >>>>>> > > > >>>>>> Folks, > > > >>>>>> Is there any potential issue with creating 240 topics every day? > > > >> Although > > > >>>>>> the retention of each topic is set to be 2 days, I am a little > > > >> concerned > > > >>>>>> that since right now there is no delete topic api, the > zookeepers > > > >> might be > > > >>>>>> overloaded. > > > >>>>>> Thanks, > > > >>>>>> Chen > > > >> > > > > > >