Thibaud, I wouldn't say this is a 'robust' solution, but the Wikimedia Foundation uses a piece of software we wrote called udp2log. We are in the process of replacing it with more robust direct Kafka producers, but it has worked for us in the intermediary. udp2log is a c++ daemon that listens for (newline delimited) messages over UDP, and then multiplexes them out to pipes or files. You could use this to pipe your UDP traffic into the default console-producer that ships with Kafka. Not 'robust' for sure, but it would work I think.
Source: https://github.com/wikimedia/analytics-udplog Deb package: http://apt.wikimedia.org/wikimedia/pool/main/u/udplog/ Example config: https://gist.github.com/ottomata/8711809 Also, as a proof of concept, one of my coworkers wrote this: https://github.com/atdt/UdpKafka Similar to udp2log, but meant for exactly what you are asking for: relaying UDP packets into Kafka. -Ao On Jan 30, 2014, at 10:20 AM, Clark Breyman <cl...@breyman.com> wrote: > Thibaud, > > Sounds like one of your issues will be upstream of Kafka. Robust and UDP > aren't something I usually think of together unless you have additional > bookkeeping to detect and request lost messages. 8MB/s shouldn't be much of > a problem unless the messages are very small and looking for individual > commits. You also have the challenge of having the server > process/machine/network go away after the UDP message is received but > before it can be pushed to Kafka. > > Beyond that, there are a lot of server frameworks that work fine. I use > Dropwizard mostly since I like Java, though it doesn't support UDP > resources. There are plenty of options there, but that's probably not a > Kafka issue. > > > On Thu, Jan 30, 2014 at 6:38 AM, Philip O'Toole <phi...@loggly.com> wrote: > >> Well, you could start by looking at the Kafka Producer source code for some >> ideas. We have built plenty of solid software on that. >> >> As to your goal of building something solid, robust, and critical. All I >> can say is you then need to keep your Producer as simple as possible -- the >> simpler it is, the less like it is to crash, have bugs, and you must test >> it very well. Get the data to Kafka as fast as possible, so the chance of >> losing any due to a crash are very small. Take a long time to test it. The >> Producers I have written (in C++) run for weeks without going down (and >> then we usually bring them down on purpose for upgrades). However, they >> were in test for months too. >> >> http://www.youtube.com/watch?v=LpNbjXFPyZ0 >> >> >> On Thu, Jan 30, 2014 at 6:31 AM, Thibaud Chardonnens >> <thibaud...@gmail.com>wrote: >> >>> Thanks for your quick answer. >>> Yes, sorry it's probably too broad but my main question was if there is >>> any best practices to build a robust, fault-tolerant producer that >>> guarantees that no data will be dropped while listening on the port. >>> From my point of view the producer will be the most critical part in the >>> system, if something goes wrong with it, the workflow will be stopped and >>> data will be lost. >>> >>> Do you have by any chance a pointer to an existing implementation of a >>> such producer? >>> >>> Thanks >>> >>> >>> Le 30 janv. 2014 à 15:13, Philip O'Toole <phi...@loggly.com> a écrit : >>> >>>> What exactly are you struggling with? Your question is too broad. What >>> you want to do is eminently possible, having done it myself from scratch. >>>> >>>> Philip >>>> >>>>> On Jan 30, 2014, at 6:00 AM, Thibaud Chardonnens < >> thibaud...@gmail.com> >>> wrote: >>>>> >>>>> Hello -- I am struggling about how to design a robust implementation >> of >>> a producer. >>>>> >>>>> My use case is quite simple: >>>>> I want to process a relatively big stream (~8MB/s) with Storm. Kafka >>> will be used as intermediate between the stream and Storm. The stream is >>> sent to a specific server on a specific port (through UDP). So Storm will >>> be the consumer and I need to write a producer (basically in Java) that >>> will listen on that specific port and send messages to a Kafka topic. >>>>> >>>>> Kafka and Storm are well designed and fault-tolerant, if a node goes >>> down the whole environment continues to work properly etc... Therefore my >>> producer will be a single point of failure in the workflow. Moreover, >>> writing a such producer is not so easy, I'll need to write a >> multithreaded >>> server to keep up with the throughput of the stream without guarantee >> that >>> no data will be dropped... >>>>> >>>>> So I would like to know if there is some best practices to write a >> such >>> producer or is there an other (maybe simpler) way to do? >>>>> >>>>> Thanks, >>>>> Thibaud >>> >>> >>