To clarify my earlier statement, I will continue working on Maelstrom as an alternative to official Spark integration with Kafka and keep the KafkaRDDs + Consumers as it is - until I find the official Spark Kafka more stable and resilient to Kafka broker issues/failures (reason I have infinite retry strategy on numerous places around Kafka related routines).
Not that i'm complaining or competing, at the end of the day having a Spark App that continues to work overnight gives developer a good sleep at night :) On Thu, Aug 25, 2016 at 3:23 AM, Jeoffrey Lim <jeoffr...@gmail.com> wrote: > Hi Cody, thank you for pointing out sub-millisecond processing, it is > an "exaggerated" term :D I simply got excited releasing this project, it > should be: "millisecond stream processing at the spark level". > > Highly appreciate the info about latest Kafka consumer. Would need > to get up to speed about the most recent improvements and new features > of Kafka itself. > > I think with Spark's latest Kafka Integration 0.10 features, Maelstrom's > upside would only be the simple APIs (developer friendly). I'll play > around with Spark 2.0 kafka-010 KafkaRDD to see if this is feasible. > > > On Wed, Aug 24, 2016 at 10:46 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Yes, spark-streaming-kafka-0-10 uses the new consumer. Besides >> pre-fetching messages, the big reason for that is that security >> features are only available with the new consumer. >> >> The Kafka project is at release 0.10.0.1 now, they think most of the >> issues with the new consumer have been ironed out. You can track the >> progress as to when they'll remove the "beta" label at >> https://issues.apache.org/jira/browse/KAFKA-3283 >> >> As far as I know, Kafka in general can't achieve sub-millisecond >> end-to-end stream processing, so my guess is you need to be more >> specific about your terms there. >> >> I promise I'm not trying to start a pissing contest :) just wanted to >> check if you were aware of the current state of the other consumers. >> Collaboration is always welcome. >> >> >> On Tue, Aug 23, 2016 at 10:18 PM, Jeoffrey Lim <jeoffr...@gmail.com> >> wrote: >> > Apologies, I was not aware that Spark 2.0 has Kafka Consumer >> caching/pooling >> > now. >> > What I have checked is the latest Kafka Consumer, and I believe it is >> still >> > in beta quality. >> > >> > https://kafka.apache.org/documentation.html#newconsumerconfigs >> > >> >> Since 0.9.0.0 we have been working on a replacement for our existing >> >> simple and high-level consumers. >> >> The code is considered beta quality. >> > >> > Not sure about this, does Spark 2.0 Kafka 0.10 integration already uses >> this >> > one? Is it now stable? >> > With this caching feature in Spark 2,.0 could it achieve >> sub-milliseconds >> > stream processing now? >> > >> > >> > Maelstrom still uses the old Kafka Simple Consumer, this library was >> made >> > open source so that I >> > could continue working on it for future updates & improvements like >> when the >> > latest Kafka Consumer >> > gets a stable release. >> > >> > We have been using Maelstrom "caching concept" for a long time now, as >> > Receiver based Spark Kafka integration >> > does not work for us. There were thoughts about using Direct Kafka APIs, >> > however Maelstrom has >> > very simple APIs and just "simply works" even under unstable scenarios >> (e.g. >> > advertised hostname failures on EMR). >> > >> > Maelstrom will work I believe even for Spark 1.3 and Kafka 0.8.2.1 (and >> of >> > course with the latest Kafka 0.10 as well) >> > >> > >> > On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >> >> >> Were you aware that the spark 2.0 / kafka 0.10 integration also reuses >> >> kafka consumer instances on the executors? >> >> >> >> On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim <jeoffr...@gmail.com> >> wrote: >> >> > Hi, >> >> > >> >> > I have released the first version of a new Kafka integration with >> Spark >> >> > that we use in the company I work for: open sourced and named >> Maelstrom. >> >> > >> >> > It is unique compared to other solutions out there as it reuses the >> >> > Kafka Consumer connection to achieve sub-milliseconds latency. >> >> > >> >> > This library has been running stable in production environment and >> has >> >> > been proven to be resilient to numerous production issues. >> >> > >> >> > >> >> > Please check out the project's page in github: >> >> > >> >> > https://github.com/jeoffreylim/maelstrom >> >> > >> >> > >> >> > Contributors welcome! >> >> > >> >> > >> >> > Cheers! >> >> > >> >> > Jeoffrey Lim >> >> > >> >> > >> >> > P.S. I am also looking for a job opportunity, please look me up at >> >> > Linked In >> > >> > >> > >