Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Cody Koeninger
Honestly, I would stay far away from saving offsets in Zookeeper if at all possible. It's better to store them alongside your results. On Wed, Oct 26, 2016 at 10:44 AM, Sunita Arvind wrote: > This is enough to get it to work: > > df.save(conf.getString("ParquetOutputPath")+offsetSaved, "parquet",

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
This is enough to get it to work: df.save(conf.getString("ParquetOutputPath")+offsetSaved, "parquet", SaveMode.Overwrite) And tests so far (in local env) seem good with the edits. Yet to test on the cluster. Cody, appreciate your thoughts on the edits. Just want to make sure I am not doing an ov

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
The error in the file I just shared is here: val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/" + partition._2(0); --> this was just partition and hence there was an error fetching the offset. Still testing. Somehow Cody, your code never lead to file already exists sort of error

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Attached is the edited code. Am I heading in right direction? Also, I am missing something due to which, it seems to work well as long as the application is running and the files are created right. But as soon as I restart the application, it goes back to fromOffset as 0. Any thoughts? regards Sun

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Thanks for confirming Cody. To get to use the library, I had to do: val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"), "/consumers/topics/"+ topics + "/0") It worked well. However, I had to specify the partitionId in the zkPath. If I want the library to pick all the partition

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Cody Koeninger
You are correct that you shouldn't have to worry about broker id. I'm honestly not sure specifically what else you are asking at this point. On Tue, Oct 25, 2016 at 1:39 PM, Sunita Arvind wrote: > Just re-read the kafka architecture. Something that slipped my mind is, it > is leader based. So to

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Just re-read the kafka architecture. Something that slipped my mind is, it is leader based. So topic/partitionId pair will be same on all the brokers. So we do not need to consider brokerid while storing offsets. Still exploring rest of the items. regards Sunita On Tue, Oct 25, 2016 at 11:09 AM, S

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Hello Experts, I am trying to use the saving to ZK design. Just saw Sudhir's comments that it is old approach. Any reasons for that? Any issues observed with saving to ZK. The way we are planning to use it is: 1. Following http://aseigneurin.github.io/2016/05/07/spark-kafka- achieving-zero-data-lo

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Cody Koeninger
See https://github.com/koeninger/kafka-exactly-once On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed" wrote: > Hi Experts, > > I am looking for some information on how to acheive zero data loss while > working with kafka and Spark. I have searched online and blogs have > different answer. Please l

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Sudhir Babu Pothineni
saving offsets to zookeeper is old approach, check-pointing internally saves the offsets to HDFS/location of checkpointing. more details here: http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Tue, Aug 23, 2016 at 10:30 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wro

Zero Data Loss in Spark with Kafka

2016-08-23 Thread KhajaAsmath Mohammed
Hi Experts, I am looking for some information on how to acheive zero data loss while working with kafka and Spark. I have searched online and blogs have different answer. Please let me know if anyone has idea on this. Blog 1: https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread Tathagata Das
But compile scope is supposed to be added to the assembly. https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 wrote: > Hi Iulian, > > On 26 May 2015, at 13:04, Iulian Dragoș > wrote: > > > > >

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread algermissen1971
Hi Iulian, On 26 May 2015, at 13:04, Iulian Dragoș wrote: > > On Tue, May 26, 2015 at 10:09 AM, algermissen1971 > wrote: > Hi, > > I am setting up a project that requires Kafka support and I wonder what the > roadmap is for Scala 2.11 Support (including Kafka). > > Can we expect to see 2.1

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-05-26 Thread Iulian Dragoș
On Tue, May 26, 2015 at 10:09 AM, algermissen1971 < algermissen1...@icloud.com> wrote: > Hi, > > I am setting up a project that requires Kafka support and I wonder what > the roadmap is for Scala 2.11 Support (including Kafka). > > Can we expect to see 2.11 support anytime soon? > The upcoming 1.

Roadmap for Spark with Kafka on Scala 2.11?

2015-05-26 Thread algermissen1971
Hi, I am setting up a project that requires Kafka support and I wonder what the roadmap is for Scala 2.11 Support (including Kafka). Can we expect to see 2.11 support anytime soon? Jan - To unsubscribe, e-mail: user-unsubscr...

Re: spark with kafka

2015-04-19 Thread Cody Koeninger
Take a look at https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md if you haven't already. If you're fine with saving offsets yourself, I'd stick with KafkaRDD, as Koert said. I haven't tried 2 hour stream batch durations, so I can't vouch for using createDirectStream in that

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
I mean to say it is simpler in case of failures, restarts, upgrades, etc. Not just failures. But they did do a lot of work on streaming from kafka in spark 1.3.x to make it simpler (streaming simple calls KafkaRDD for every batch if you use KafkaUtils.createDirectStream), so maybe i am wrong and s

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Yeah I think would pick the second approach because it is simpler operationally in case of any failures. But of course the smaller the window gets the more attractive the streaming solution gets. We do daily extracts, not every 2 hours. On Sat, Apr 18, 2015 at 2:57 PM, Shushant Arora wrote: > T

Re: spark with kafka

2015-04-18 Thread Shushant Arora
Thanks Koert. So in short for Highlevel api I ll have to go with spark streaming only and there the issue is of handling cluster restart , thats why you opted for second approach of batch job or due to batch interval (2 hours is large for stream job) or some other reason? On Sun, Apr 19, 2015 a

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
KafkaRDD uses the simple consumer api. and i think you need to handle offsets yourself, unless things changed since i last looked. I would do second approach. On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora wrote: > Thanks !! > I have few more doubts : > > Does kafka RDD uses simpleAPI for kafk

Re: spark with kafka

2015-04-18 Thread Shushant Arora
Thanks !! I have few more doubts : Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I mean do I need to handle offset of partitions myself or it will be taken care by KafkaRDD, Plus which one is better for batch programming. I have a requirement to read kafka messages by a spark

Re: spark with kafka

2015-04-18 Thread Ilya Ganelin
That's a much better idea :) On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers wrote: > Use KafkaRDD directly. It is in spark-streaming-kafka package > > On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora > wrote: > >> Hi >> >> I want to consume messages from kafka queue using spark batch program not

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora wrote: > Hi > > I want to consume messages from kafka queue using spark batch program not > spark streaming, Is there any way to achieve this, other than using low > level(simple api) of

RE: spark with kafka

2015-04-18 Thread Ganelin, Ilya
ime To: user Subject: spark with kafka Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low level(simple api) of kafka consumer. Thanks ___

spark with kafka

2015-04-18 Thread Shushant Arora
Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low level(simple api) of kafka consumer. Thanks