Re: Issue with 240 topics per day

2014-08-12 Thread Philip O'Toole
Todd -- can you share details of the ZK cluster you are running, to support this scale? Is it one single Kafka cluster? Are you using 1 single ZK cluster? Thanks, Philip   - http://www.philipotoole.com On Monday, August 11, 2014 9:32 PM, Todd Palino

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Got it. thanks for the input Todd! Chen On Mon, Aug 11, 2014 at 9:31 PM, Todd Palino wrote: > As I noted, we have a cluster right now with 70k partitions. It’s running > on over 30 brokers, partly to cover the number of partitions and and > partly to cover the amount of data that we push throug

Re: Issue with 240 topics per day

2014-08-11 Thread Todd Palino
As I noted, we have a cluster right now with 70k partitions. It’s running on over 30 brokers, partly to cover the number of partitions and and partly to cover the amount of data that we push through it. If you can have at least 4 or 5 brokers, I wouldn’t anticipate any problems with the number of p

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Todd, Yes I actually thought about that. My concern is that even a weeks topic partition(240*7*3 = 5040) is too many. Does linkedin have a good experience in using this many topics in your system?:-) Thanks, Chen On Mon, Aug 11, 2014 at 9:02 PM, Todd Palino wrote: > In order to delete topics, y

Re: Issue with 240 topics per day

2014-08-11 Thread Todd Palino
In order to delete topics, you need to shut down the entire cluster (all brokers), delete the topics from Zookeeper, and delete the log files and partition directory from the disk on the brokers. Then you can restart the cluster. Assuming that you can take a periodic outage on your cluster, you can

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Unfortunately, this would not work in our system. It means almost every several minutes i will need to scan the entire queue, which is not possible in our case. In fact, our old system is designed in this way: store the data in hbase, and with hourly mapreduce to scan entire table figure out which

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Vipul, The problem is that the producer does not know when it should set the window start and window end boundary.. The data does not arrive in order. I also think its difficult to get the offset of the boundary, and only pull messages between those boundaries: i am already trying to avoid use the

Re: Issue with 240 topics per day

2014-08-11 Thread Philip O'Toole
Ok, now that is good detail. I understand your issue. It's somewhat difficult as to use Kafka in your situation, as Kafka is a FIFO queue, but you are trying to use it with data that is not tightly ordered in that manner.  I don't have any definite solutions, but perhaps this might work. Assu

Re: Issue with 240 topics per day

2014-08-11 Thread vipul jhawar
Your use case requires messages to pushed out when time comes instead of the order in which they arrived, while kafka may not be best for this as within the Q you want some message batch to be sent out early and some later. There could be another way to solve this with offset management as kafka is

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Those data has a timestamp: its actually email campaigns with scheduled send time. But since they can be scheduled ahead(e.g, two days ahead), I cannot read it when it arrives. It has to wait until its actual scheduled send time. As you can tell, the sequence within the 6 min does not matter, but

Re: Issue with 240 topics per day

2014-08-11 Thread Philip O'Toole
Why do you need to read it every 6 minutes? Why not just read it as it arrives? If it naturally arrives in 6 minute bursts, you'll read it in 6 minute bursts, no? Perhaps the data does not have timestamps embedded in it, so that is why you are relying on time-based topic names? In that case I w

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
"And if you can't consume it all within 6 minutes, partition the topic until you can run enough consumers such that you can keep up.", this is what I intend to do for each 6min -topic. What I really need is a partitioned queue: each 6 minute of data can put into a separate partition, so that I can

Re: Issue with 240 topics per day

2014-08-11 Thread Philip O'Toole
It's still not clear to me why you need to create so many topics. Write the data to a single topic and consume it when it arrives. It doesn't matter if it arrives in bursts, as long as you can process it all within 6 minutes, right? And if you can't consume it all within 6 minutes, partition t

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Philip, That is right. There is huge amount of data flushed into the topic within each 6 minutes. Then at the end of each 6 min, I only want to read from that specify topic, and data within that topic has to be processed as fast as possible. I was originally using redis queue for this purpose, but

Re: Issue with 240 topics per day

2014-08-11 Thread Philip O'Toole
I'd love to know more about what you're trying to do here. It sounds like you're trying to create topics on a schedule, trying to make it easy to locate data for a given time range? I'm not sure it makes sense to use Kafka in this manner. Can you provide more detail? Philip   ---

Re: Issue with 240 topics per day

2014-08-11 Thread Chen Wang
Todd, I actually only intend to keep each topic valid for 3 days most. Each of our topic has 3 partitions, so its around 3*240*3 =2160 partitions. Since there is no api for deleting topic, i guess i could set up a cron job deleting the out dated topics(folders) from zookeeper.. do you know when the

Re: Issue with 240 topics per day

2014-08-11 Thread Todd Palino
You need to consider your total partition count as you do this. After 30 days, assuming 1 partition per topic, you have 7200 partitions. Depending on how many brokers you have, this can start to be a problem. We just found an issue on one of our clusters that has over 70k partitions that there¹s no