Re: Reading Real Time Data only from Kafka

2015-05-19 Thread Akhil Das
Cool. Thanks for the detailed response Cody. Thanks Best Regards On Tue, May 19, 2015 at 6:43 PM, Cody Koeninger wrote: > If those questions aren't answered by > > https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md > > please let me know so I can update it. > > If you set a

Re: Reading Real Time Data only from Kafka

2015-05-19 Thread Cody Koeninger
If those questions aren't answered by https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md please let me know so I can update it. If you set auto.offset.reset to largest, it will start at the largest offset. Any messages before that will be skipped, so if prior runs of the jo

Re: Reading Real Time Data only from Kafka

2015-05-18 Thread Akhil Das
I have played a bit with the directStream kafka api. Good work cody. These are my findings and also can you clarify a few things for me (see below). -> When "auto.offset.reset"-> "smallest" and you have 60GB of messages in Kafka, it takes forever as it reads the whole 60GB at once. "largest" will

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Cody Koeninger
You linked to a google mail tab, not a public archive, so I don't know exactly which conversation you're referring to. As far as I know, streaming only runs a single job at a time in the order they were defined, unless you turn on an experimental option for more parallelism (TD or someone more kno

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Dibyendu Bhattacharya
Thanks Cody for your email. I think my concern was not to get the ordering of message within a partition , which as you said is possible if one knows how Spark works. The issue is how Spark schedule jobs on every batch which is not on the same order they generated. So if that is not guaranteed it

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Cody Koeninger
As far as I can tell, Dibyendu's "cons" boil down to: 1. Spark checkpoints can't be recovered if you upgrade code 2. Some Spark transformations involve a shuffle, which can repartition data It's not accurate to imply that either one of those things are inherently "cons" of the direct stream api.

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Dibyendu Bhattacharya
The low level consumer which Akhil mentioned , has been running in Pearson for last 4-5 months without any downtime. I think this one is the reliable "Receiver Based" Kafka consumer as of today for Spark .. if you say it that way .. Prior to Spark 1.3 other Receiver based consumers have used Kafka

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Hi Cody, I was just saying that i found more success and high throughput with the low level kafka api prior to KafkfaRDDs which is the future it seems. My apologies if you felt it that way. :) On 12 May 2015 19:47, "Cody Koeninger" wrote: > Akhil, I hope I'm misreading the tone of this. If you ha

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Many thanks both, appreciate the help. On Tue, May 12, 2015 at 4:18 PM, Cody Koeninger wrote: > Yes, that's what happens by default. > > If you want to be super accurate about it, you can also specify the exact > starting offsets for every topic/partition. > > On Tue, May 12, 2015 at 9:01 AM, Ja

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Cody Koeninger
Yes, that's what happens by default. If you want to be super accurate about it, you can also specify the exact starting offsets for every topic/partition. On Tue, May 12, 2015 at 9:01 AM, James King wrote: > Thanks Cody. > > Here are the events: > > - Spark app connects to Kafka first time and

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Cody Koeninger
Akhil, I hope I'm misreading the tone of this. If you have personal issues at stake, please take them up outside of the public list. If you have actual factual concerns about the kafka integration, please share them in a jira. Regarding reliability, here's a screenshot of a current production job

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Thanks Cody. Here are the events: - Spark app connects to Kafka first time and starts consuming - Messages 1 - 10 arrive at Kafka then Spark app gets them - Now driver dies - Messages 11 - 15 arrive at Kafka - Spark driver program reconnects - Then Messages 16 - 20 arrive Kafka What I want is th

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Hi Cody, If you are so sure, can you share a bench-marking (which you ran for days maybe?) that you have done with Kafka APIs provided by Spark? Thanks Best Regards On Tue, May 12, 2015 at 7:22 PM, Cody Koeninger wrote: > I don't think it's accurate for Akhil to claim that the linked library i

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Cody Koeninger
I don't think it's accurate for Akhil to claim that the linked library is "much more flexible/reliable" than what's available in Spark at this point. James, what you're describing is the default behavior for the createDirectStream api available as part of spark since 1.3. The kafka parameter auto

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Very nice! will try and let you know, thanks. On Tue, May 12, 2015 at 2:25 PM, Akhil Das wrote: > Yep, you can try this lowlevel Kafka receiver > https://github.com/dibbhatt/kafka-spark-consumer. Its much more > flexible/reliable than the one comes with Spark. > > Thanks > Best Regards > > On Tu

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark. Thanks Best Regards On Tue, May 12, 2015 at 5:15 PM, James King wrote: > What I want is if the driver dies for some reason and it is res

Reading Real Time Data only from Kafka

2015-05-12 Thread James King
What I want is if the driver dies for some reason and it is restarted I want to read only messages that arrived into Kafka following the restart of the driver program and re-connection to Kafka. Has anyone done this? any links or resources that can help explain this? Regards jk