You might want to take a look at mps (github.com/milindparikh/mps). As I was thinking about blogging about it, your use-case magically surfaced on the mailing list. Since one of the supposed benefits of having a framework is to enable quick building of stuff on top of it, I took a crack at your use case. I attribute it in Part II ( http://milindparikh.blogspot.com/2013/06/introducing-mps-eventing-framework-for_14.html). Part I ( http://milindparikh.blogspot.com/2013/06/introducing-mps-eventing-framework-for.html ) is here.
mps is not in Java. So you may not be able to make use of the framework as is. So I will try to explain the basic concept here; in the hopes that this will be useful to you and others who might be interested in this sort of thing. Firstly mps recognizes that kafka squeezes everything out of the hard-disk and the os happiily laps away at the RAM as required. The two other optimizable parameters are network and compute capacity. The concept of topic and partition are taken up by kafka and therefore, at the application level, if the utilization of kafka is to be optimized, one shouldn't try to munge these concepts. The other concept is that kafka treats messages as a unit. The fact that a message has a key (and a value) is mentioned "by the way" in kafka documentation. Of course the partitioning relies on keys; but that's is where it stops. This is fortunate; because it potentially gives a way of managing messages at the higher level exploiting network and compute capacity and leaves kafka to manage the disk and os the RAM; through consistent hashing; so that you are not saturating the entire available bandwidth and expending enormous computing capacity to search for the proverbial needle in a haystack. I say potentially because I am in erlang land where I know where mps can rest on shoulders of "giants". HTH Milind On Fri, Jun 14, 2013 at 8:27 PM, Josh Foure <user...@yahoo.com> wrote: > Hi Mahendra, thanks for your reply. I was planning on using the > Atmosphere Framework (http://async-io.org/) to handle the web push stuff > (I've never used it before but we use PrimeFaces a little and that's what > they use for their components). I thought that I would have the JVM that > the user is connected to just be a Kafka consumer. Given the topic > limitations, I think I am back to having a single topic that all guest data > is placed on and have all JVMs publish and consume to that same topic. So > it would look something like this: > > - I have 20 Web JVMs. > - Every minute 100 people log in per JVM. So 2,000 log ins per minute. > Each Web JVM publishes a single message per log in. > - My data services consume the log in event and then create about 1,000 > messages per user containing data about that user. Each data message will > probably be between 500 byte and 2k. Let's assume an average of 1k per > message so that would be 1 MB per user or about 2GB per minute. > - The Recommendation service would consume all 2GB of data per minute and > only end up using small amount of the data and then it would add it's > recommendation messages to the same topic. > - Each Web JVM also would consume the 2GB of data plus the handful of > recommendation messages per minute and end up ignoring everything but the > recommendation messages (especially since the 2GB represents the data for > all the guests but each JVM only has 1/20 of the guest logged in). > > It seems wasteful to put 2 GB of data per minute in Kafka only to have the > Recommendation service consume all this data and only end up using a few k > of data and also have the web consume all this data when it just wants the > few recommendation messages. However, the benefit of using a single topic > is that in the future other services could consume more of the data or the > recommendation messages and since everything is on the same topic the order > is guaranteed. In our immediate use case we could put the recommendation > messages on its own topics but in a sense we would be coupling our use case > to our choice of topics. If we want the web to start also showing a little > bit of the data from the data messages, we would be back to consuming the > 2GB of data in the Web JVMs. > > Traditionally we would just have the Web call a service on the > Recommendation system (possibly asynchronously) which in turn would call > the database to load just the data it needs. But we are thinking that by > publishing all the data we have about the user (whether in the immediate > future the existing systems need that data or not, future systems might) we > are creating a system where we can easily add new consumers to do new > things with all this data. The main downside seems to be that most of the > consumers are processing millions of messages that they have no interest > in. Do you think that the benefits outweigh the cons? Is there a better > way to achieve similar results? > > Thanks > Josh > > > > > ________________________________ > From: Mahendra M <mahendr...@gmail.com> > To: users@kafka.apache.org; Josh Foure <user...@yahoo.com> > Sent: Friday, June 14, 2013 8:03 AM > Subject: Re: Using Kafka for "data" messages > > > > Hi Josh, > > Thanks for clarifying the use case. The idea is good, but I see the > following three issues > 1. Creating a queue for each user. There could be limits on this > > 2. Removing old queues > 3. If the same user logs in from multiple browsers, things get a > bit more complex.Can I suggest an alternate approach than using Kafka? > > Using a combination of Kafka and XMPP-BOSH/Comet for this. > 1. User logs in. Message is sent on a Kafka queue. > 2. Web browser starts a long polling connection to a server > (XMPP-BOSH / Comet) > 3. Consumers pick up message in (1) and do their job. They push > their results to a results queue and to an XMPP end-point (u...@domain.com > ) > 4. Recommender can pick up from the results queue and push it's > result to the XMPP end-point > 5. Web front-end picks up the messages and does the displaying job. > If you plan it more, you can avoid using Kafka in this use case and just > do with XMPP (for steps 1 and 3) > > Also, you don't have to take care of large number of queues, removing them > etc. Also XMPP is really good in handling multiple end-points for a single > user. (There are good XMPP servers like ejabberd and tigase. Also good > lightweight JS libraries for handling connections). > > PS: I think my reply is going off-topic. So, I will stop. > > Regards, > Mahendra > > > > On Thu, Jun 13, 2013 at 11:17 PM, Josh Foure <user...@yahoo.com> wrote: > > Hi Mahendra, I think that is where it gets a little tricky. I think it > would work something like this: > > > >1. Web sends login event for user "user123" to topic "GUEST_EVENT". > >2. All of the systems consume those messages and publish the data > messages to topic "GUEST_DATA.user123". > >3. The Recommendation system gets all of the data from > "GUEST_DATA.user123", processes and then publishes back to the same topic > "GUEST_DATA.user123". > >4. The Web consumes the messages from the same topic (there is a > different topic for every user that logged in) "GUEST_DATA.user123" and > when it finds the recommendation messages it pushes that to the browser > (note it will need to read all the other data messages and discard those > when looking for the recommendation messages). I have a concern that the > Web will be flooded with a ton of messages that it will promptly drop but I > don't want to create a new "response" or "recommendation" topic because > then I feel like I am tightly coupling the message to the functionality and > in the future different systems may want to consume those messages as well. > > > >Does that make sense? > >Josh > > > > > > > > > > > > > > > >________________________________ > > From: Mahendra M <mahendr...@gmail.com> > >To: users@kafka.apache.org; Josh Foure <user...@yahoo.com> > >Sent: Thursday, June 13, 2013 12:56 PM > >Subject: Re: Using Kafka for "data" messages > > > > > > > >Hi Josh, > > > >The idea looks very interesting. I just had one doubt. > > > >1. A user logs in. His login id is sent on a topic > >2. Other systems (consumers on this topic) consumer this message and > >publish their results to another topic > > > >This will be happening without any particular order for hundreds of users. > > > >Now the site being displayed to the user.. How will you fetch only > messages > >for that user from the queue? > > > >Regards, > >Mahendra > > > > > > > >On Thu, Jun 13, 2013 at 8:51 PM, Josh Foure <user...@yahoo.com> wrote: > > > >> > >> Hi all, my team is proposing a novel > >> way of using Kafka and I am hoping someone can help do a sanity check on > >> this: > >> > >> 1. When a user logs > >> into our website, we will create a “logged in” event message in Kafka > >> containing the user id. > >> 2. 30+ systems > >> (consumers each in their own consumer groups) will consume this event > and > >> lookup data about this user id. They > >> will then publish all of this data back out into Kafka as a series of > data > >> messages. One message may include the user’s name, > >> another the user’s address, another the user’s last 10 searches, another > >> their > >> last 10 orders, etc. The plan is that a > >> single “logged in” event may trigger hundreds if not thousands of > >> additional data > >> messages. > >> 3. Another system, > >> the “Product Recommendation” system, will have consumed the original > >> “logged in” > >> message and will also consume a subset of the data messages > (realistically > >> I > >> think it would need to consume all of the data messages but would > discard > >> the > >> ones it doesn’t need). As the Product > >> Recommendation consumes the data messages, it will process recommended > >> products > >> and publish out recommendation messages (that get more and more specific > >> as it > >> has consumed more and more data messages). > >> 4. The original > >> website will consume the recommendation messages and show the > >> recommendations to > >> the user as it gets them. > >> > >> You don’t see many systems implemented this way but since > >> Kafka has such a higher throughput than your typical MOM, this approach > >> seems > >> innovative. > >> > >> The benefits are: > >> > >> 1. If we start > >> collecting more information about the users, we can simply start > publishing > >> that in new data messages and consumers can start processing those > messages > >> whenever they want. If we were doing > >> this in a more traditional SOA approach the schemas would need to change > >> every time > >> we added a field but with this approach we can just create new messages > >> without > >> touching existing ones. > >> 2. We are looking to > >> make our systems smaller so if we end up with more, smaller systems that > >> each > >> publish a small number of events, it becomes easier to make changes and > >> test > >> the changes. If we were doing this in a > >> more traditional SOA approach we would need to retest each consumer > every > >> time > >> we changed our bigger SOA services. > >> > >> The downside appears to be: > >> > >> 1. We may be > >> publishing a large amount of data that never gets used but that everyone > >> needs > >> to consume to see if they need it before discarding it. > >> 2. The Product Recommendation > >> system may need to wait until it consumes a number of messages and keep > >> track > >> of all the data internally before it can start processing. > >> 3. While we may be > >> able to keep the messages somewhat small, the fact that they contain > data > >> will > >> mean they will be bigger than your tradition EDA messages. > >> 4. It seems like we > >> can do a lot of this using SOA (we already have an ESB than can do > >> transformations to address consumers expecting an older version of the > >> data). > >> > >> Any insight is appreciated. > >> Thanks, > >> Josh > > > > > > > > > >-- > >Mahendra > > > >http://twitter.com/mahendra > > > -- > Mahendra > > http://twitter.com/mahendra >