1) You save everything 2 times (kafka and hdfs). 2) You need to enable the checkpoint feature, that means you cannot change the configuration of the job, because the spark streaming context is deserialized from hdfs every time you restart the job. 3) What happens if hdfs is unavailable, not clear? 4) It's not transactional. For example we tried to use it to move some data from kafka to elasticsearch. If elasticsearch is down, the spark streaming job doesn't stop. It cares only about failures of spark or kafka.
On Fri, Mar 13, 2015 at 8:48 PM, Andrew Otto <ao...@wikimedia.org> wrote: > > We are currently using spark streaming 1.2.1 with kafka and write-ahead > log. > > I will only say one thing : "a nightmare". ;-) > I’d be really interested in hearing about your experience here. I’m > exploring streaming frameworks a bit, and Spark Streaming is just so easy > to use and set up. I’d be nice if it worked well. > > > > > > On Mar 13, 2015, at 15:38, Alberto Miorin <amiorin78+ka...@gmail.com> > wrote: > > > > We are currently using spark streaming 1.2.1 with kafka and write-ahead > log. > > I will only say one thing : "a nightmare". ;-) > > > > Let's see if things are better with 1.3.0 : > > http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html > > > > On Fri, Mar 13, 2015 at 8:33 PM, William Briggs <wrbri...@gmail.com> > wrote: > > > >> Spark Streaming also has built-in support for Kafka, and as of Spark > 1.2, > >> it supports using an HDFS write-ahead log to ensure zero data loss while > >> streaming: > >> > https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html > >> > >> -Will > >> > >> On Fri, Mar 13, 2015 at 3:28 PM, Alberto Miorin < > amiorin78+ka...@gmail.com > >>> wrote: > >> > >>> I'll try this too. It looks very promising. > >>> > >>> Thx > >>> > >>> On Fri, Mar 13, 2015 at 8:25 PM, Gwen Shapira <gshap...@cloudera.com> > >>> wrote: > >>> > >>>> There's a KafkaRDD that can be used in Spark: > >>>> https://github.com/tresata/spark-kafka. It doesn't exactly replace > >>>> Camus, but should be useful in building Camus-like system in Spark. > >>>> > >>>> On Fri, Mar 13, 2015 at 12:15 PM, Alberto Miorin > >>>> <amiorin78+ka...@gmail.com> wrote: > >>>>> We use spark on mesos. I don't want to partition our cluster because > >>> of > >>>> one > >>>>> YARN job (camus). > >>>>> > >>>>> Best > >>>>> > >>>>> Alberto > >>>>> > >>>>> On Fri, Mar 13, 2015 at 7:43 PM, Otis Gospodnetic < > >>>>> otis.gospodne...@gmail.com> wrote: > >>>>> > >>>>>> Just curious - why - is Camus not suitable/working? > >>>>>> > >>>>>> Thanks, > >>>>>> Otis > >>>>>> -- > >>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log > >>> Management > >>>>>> Solr & Elasticsearch Support * http://sematext.com/ > >>>>>> > >>>>>> > >>>>>> On Fri, Mar 13, 2015 at 2:33 PM, Alberto Miorin < > >>>> amiorin78+ka...@gmail.com > >>>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> I was wondering if anybody has already tried to mirror a kafka > >>> topic > >>>> to > >>>>>>> hdfs just copying the log files from the topic directory of the > >>> broker > >>>>>>> (like 00000000000023244237.log). > >>>>>>> > >>>>>>> The file format is very simple : > >>>>>>> https://twitter.com/amiorin/status/576448691139121152/photo/1 > >>>>>>> > >>>>>>> Implementing an InputFormat should not be so difficult. > >>>>>>> > >>>>>>> Any drawbacks? > >>>>>>> > >>>>>> > >>>> > >>> > >> > >> > >