great work, Dibyendu. looks like this would be a popular contribution. expanding on bharat's question a bit:
what happens if you submit multiple receivers to the cluster by creating and unioning multiple DStreams as in the kinesis example here: https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123 for more context, the kinesis implementation above uses the Kinesis Client Library (KCL) to automatically assign - and load balance - stream shards among all KCL threads from all receivers (potentially coming and going as nodes die) on all executors/nodes using DynamoDB as the association data store. ZooKeeper would be used for your Kafka consumers, of course. and ZooKeeper watches to handle the ephemeral nodes. and I see you're using Curator, which makes things easier. as bharat suggested, running multiple receivers/dstreams may be desirable from a scalability and fault tolerance standpoint. is this type of load balancing possible among your different Kafka consumers running in different ephemeral JVMs? and isn't it fun proposing a popular piece of code? the question floodgates have opened! haha. :) -chris On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi Bharat, > > Thanks for your email. If the "Kafka Reader" worker process dies, it will > be replaced by different machine, and it will start consuming from the > offset where it left over ( for each partition). Same case can happen even > if I tried to have individual Receiver for every partition. > > Regards, > Dibyendu > > > On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat <bvenkat.sp...@gmail.com> > wrote: > >> I like this consumer for what it promises - better control over offset and >> recovery from failures. If I understand this right, it still uses single >> worker process to read from Kafka (one thread per partition) - is there a >> way to specify multiple worker processes (on different machines) to read >> from Kafka? Maybe one worker process for each partition? >> >> If there is no such option, what happens when the single machine hosting >> the >> "Kafka Reader" worker process dies and is replaced by a different machine >> (like in cloud)? >> >> Thanks, >> Bharat >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >