Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Sethupathi T
Gabor, Thanks for the clarification. Thanks On Fri, Sep 6, 2019 at 12:38 AM Gabor Somogyi wrote: > Sethupathi, > > Let me extract then the important part what I've shared: > > 1. "This ensures that each Kafka source has its own consumer group that > does not face interference from any other co

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Gabor Somogyi
Sethupathi, Let me extract then the important part what I've shared: 1. "This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer" 2. Consumers may eat the data from each other, offset calculation may give back wrong result (that's the

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for bett

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for bett

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Gabor Somogyi
Hi, Let me share Spark 3.0 documentation part (Structured Streaming and not DStreams what you've mentioned but still relevant): kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query genera

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-30 Thread Cody Koeninger
You're using an older version of spark, with what looks like a manually included different version of the kafka-clients jar (1.0) than what that version of the spark connector was written to depend on (0.10.0.1), so there's no telling what's going on. On Wed, Aug 29, 2018 at 3:40 PM, Guillermo Ort

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrote: > > I'm using Spark Stre

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Cody Koeninger
Are you able to try a recent version of spark? On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández wrote: > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this > exception and Spark dies. > > I couldn't see any error or problem among the machines, anybody has the > reason

Re: Spark streaming + kafka error with json library

2017-03-30 Thread Srikanth
Thanks for the tip. That worked. When would one use the assembly? On Wed, Mar 29, 2017 at 7:13 PM, Tathagata Das wrote: > Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) > > On Wed, Mar 29, 2017 at 9:59 AM, Srikanth wrote: > >> Hello, >> >> I'm trying to use "org.json4s" %

Re: Spark streaming + kafka error with json library

2017-03-29 Thread Tathagata Das
Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) On Wed, Mar 29, 2017 at 9:59 AM, Srikanth wrote: > Hello, > > I'm trying to use "org.json4s" % "json4s-native" library in a spark > streaming + kafka direct app. > When I use the latest version of the lib I get an error simila

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger
Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches. On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami wrote: > So it worked quite well with a coalesce, I was able to find

Re: [Spark Streaming+Kafka][How-to]

2017-03-21 Thread OUASSAIDI, Sami
So it worked quite well with a coalesce, I was able to find an solution to my problem : Altough not directly handling the executor a good roundaway was to assign the desired partition to a specific stream through assign strategy and coalesce to a single partition then repeat the same process for th

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Another option that would avoid a shuffle would be to use assign and coalesce, running two separate streams. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("assign", """{t0: {"0": }, t1:{"0": x}}""") .load() .coalesce(1) .writeStream .fore

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread OUASSAIDI, Sami
@Cody : Duly noted. @Michael Ambrust : A repartition is out of the question for our project as it would be a fairly expensive operation. We tried looking into targeting a specific executor so as to avoid this extra cost and directly have well partitioned data after consuming the kafka topics. Also

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Sorry, typo. Should be a repartition not a groupBy. > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("subscribe", "t0,t1") > .load() > .repartition($"partition") > .writeStream > .foreach(... code to write to cassandra ...) >

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Michael Armbrust
I think it should be straightforward to express this using structured streaming. You could ensure that data from a given partition ID is processed serially by performing a group by on the partition column. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Cody Koeninger
Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness. On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami wrote: > > Hi all, > > So I need to specify how an executor should consume data from a kafk

Re: spark streaming kafka connector questions

2016-09-16 Thread 毅程
Thanks, That is what I am missing. I have added cache before action, and that 2nd processing is avoided. 2016-09-10 5:10 GMT-07:00 Cody Koeninger : > Hard to say without seeing the code, but if you do multiple actions on an > Rdd without caching, the Rdd will be computed multiple times. > > On Se

Re: spark streaming kafka connector questions

2016-09-10 Thread Cody Koeninger
Hard to say without seeing the code, but if you do multiple actions on an Rdd without caching, the Rdd will be computed multiple times. On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote: After some investigation, the problem i see is liked caused by a filter and union of the dstream. if i just do kafka-

Re: spark streaming kafka connector questions

2016-09-10 Thread Cheng Yi
After some investigation, the problem i see is liked caused by a filter and union of the dstream. if i just do kafka-stream -- process -- output operator, then there is no problem, one event will be fetched once. if i do kafka-stream -- process(1) - filter a stream A for later union --|

Re: spark streaming kafka connector questions

2016-09-10 Thread 毅程
Cody, Thanks for the message. 1. as you mentioned, I do find the version for kafka 0.10.1, I will use that, although lots of experimental tags. Thank you. 2. I have done thorough investigating, it is NOT the scenario where 1st process failed, then 2nd process triggered. 3. I do agree the session t

Re: spark streaming kafka connector questions

2016-09-08 Thread Cody Koeninger
- If you're seeing repeated attempts to process the same message, you should be able to look in the UI or logs and see that a task has failed. Figure out why that task failed before chasing other things - You're not using the latest version, the latest version is for spark 2.0. There are two ver

Re: Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread Rabin Banerjee
It's not required , *Simplified Parallelism:* No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one map

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Ah right i see. Thank you very much. On May 25, 2016 11:11 AM, "Cody Koeninger" wrote: > There's an overloaded createDirectStream method that takes a map from > topicpartition to offset for the starting point of the stream. > > On Wed, May 25, 2016 at 9:59 AM, trung kien wrote: > > Thank Cody.

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
There's an overloaded createDirectStream method that takes a map from topicpartition to offset for the starting point of the stream. On Wed, May 25, 2016 at 9:59 AM, trung kien wrote: > Thank Cody. > > I can build the mapping from time ->offset. However how can i pass this > offset to Spark Strea

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Thank Cody. I can build the mapping from time ->offset. However how can i pass this offset to Spark Streaming job using that offset? ( using Direct Approach) On May 25, 2016 9:42 AM, "Cody Koeninger" wrote: > Kafka does not yet have meaningful time indexing, there's a kafka > improvement proposa

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
Kafka does not yet have meaningful time indexing, there's a kafka improvement proposal for it but it has gotten pushed back to at least 0.10.1 If you want to do this kind of thing, you will need to maintain your own index from time to offset. On Wed, May 25, 2016 at 8:15 AM, trung kien wrote: >

Re: Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Cody Koeninger
I'd fix the kafka version on the executor classpath (should be 0.8.2.1) before trying anything else, even if it may be unrelated to the actual error. Definitely don't upgrade your brokers to 0.9 On Wed, May 25, 2016 at 2:30 AM, Scott W wrote: > I'm running into below error while trying to consum

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
Let me be more detailed in my response: Kafka works on “at least once” semantics. Therefore, given your assumption that Kafka "will be operational", we can assume that at least once semantics will hold. At this point, it comes down to designing for consumer (really Spark Executor) resilience.

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
At that scale, it’s best not to do coordination at the application layer. How much of your data is transactional in nature {all, some, none}? By which I mean ACID-compliant. > On Apr 19, 2016, at 10:53 AM, Erwan ALLAIN wrote: > > Cody, you're right that was an example. Target architecture wou

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Cody, you're right that was an example. Target architecture would be 3 DCs :) Good point on ZK, I'll have to check that. About Spark, both instances will run at the same time but on different topics. That would be quite useless to have to 2DCs working on the same set of data. I just want, in case

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
Maybe I'm missing something, but I don't see how you get a quorum in only 2 datacenters (without splitbrain problem, etc). I also don't know how well ZK will work cross-datacenter. As far as the spark side of things goes, if it's idempotent, why not just run both instances all the time. On Tue

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
I'm describing a disaster recovery but it can be used to make one datacenter offline for upgrade for instance. >From my point of view when DC2 crashes: *On Kafka side:* - kafka cluster will lose one or more broker (partition leader and replica) - partition leader lost will be reelected in the rem

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
It the main concern uptime or disaster recovery? > On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote: > > I think the bigger question is what happens to Kafka and your downstream data > store when DC2 crashes. > > From a Spark point of view, starting up a post-crash job in a new data center >

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
I think the bigger question is what happens to Kafka and your downstream data store when DC2 crashes. >From a Spark point of view, starting up a post-crash job in a new data center isn't really different from starting up a post-crash job in the original data center. On Tue, Apr 19, 2016 at 3:32 A

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case. As I mentionned before, I'm planning to use one kafka cluster and 2 or more spark cluster distinct. Let's say we have the following DCs configuration in a nominal case. Kafka partitions are consumed uniformly by the 2 data

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks, Jason > On

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger
The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions. https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions

Re: Spark Streaming + Kafka + scala job message read issue

2016-01-15 Thread vivek.meghanathan
bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com> Cc: duc.was.h...@gmail.com<mailto:duc.was.h...@gmail.com>; vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka

RE: Spark Streaming + Kafka + scala job message read issue

2016-01-05 Thread vivek.meghanathan
Meghanathan (WT01 - NEP) Sent: 27 December 2015 11:08 To: Bryan Cc: Vivek Meghanathan (WT01 - NEP) ; duc.was.h...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Bryan, Yes we are using only 1 thread per topic as we have only one Kafka

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread vivek.meghanathan
.@gmail.com> Cc: duc.was.h...@gmail.com<mailto:duc.was.h...@gmail.com>; vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Brian,PhuDuc, All 8 jobs

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread Bryan
...@gmail.com Cc: duc.was.h...@gmail.com; vivek.meghanat...@wipro.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Brian,PhuDuc, All 8 jobs are consuming 8 different IN topics. 8 different Scala jobs running each topic map mentioned below has only 1 thread

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
; for Windows 10 phone From: PhuDuc Nguyen<mailto:duc.was.h...@gmail.com> Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka + scala

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
Agreed. I did not see that they were using the same group name. Sent from Outlook Mail for Windows 10 phone From: PhuDuc Nguyen Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com Cc: user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread PhuDuc Nguyen
> > > > > > Regards, > Vivek M > > *From:* Bryan [mailto:bryan.jeff...@gmail.com] > *Sent:* 24 December 2015 17:20 > *To:* Vivek Meghanathan (WT01 - NEP) ; > user@spark.apache.org > *Subject:* RE: Spark Streaming + Kafka + scala job message read issue > > >

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
...@wipro.com Sent: Friday, December 25, 2015 2:18 PM To: bryan.jeff...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24, 2015 7:50 PM To: Bryan; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue We are using the

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread vivek.meghanathan
anathan (WT01 - NEP) ; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue Are you using a direct stream consumer, or the older receiver based consumer? If the latter, do the number of partitions you’ve specified for your topic match the number of partitions

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread Bryan
Are you using a direct stream consumer, or the older receiver based consumer? If the latter, do the number of partitions you’ve specified for your topic match the number of partitions in the topic on Kafka? That would be an possible cause – as you might receive all data from a given partition

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Good catch! BTW, great choice with ScalaPB, we moved from scalabuff as well, in order to generate the classes at compile time from sbt. Sent from my iPhone On 17 Sep 2015, at 22:00, srungarapu vamsi mailto:srungarapu1...@gmail.com>> wrote: @Saisai Shao, Thanks for the pointer. It turned out t

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
@Saisai Shao, Thanks for the pointer. It turned out to be the serialization issue. I was using scalabuff to generate my "KafkaGenericEvent" class. But when i went through the generated class code, i figured out that it is not serializable. Now i am generating my classes using scalapb ( https://gith

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Saisai Shao
Is your "KafkaGenericEvent" serializable? Since you call rdd.collect() to fetch the data to local driver, so this KafkaGenericEvent need to be serialized and deserialized through Java or Kryo (depends on your configuration) serializer, not sure if it is your problem to always get a default object.

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
If i understand correctly, i guess you are suggesting me to do this : val kafkaDStream = KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder](ssc, kafkaConf, Set(topics)) kafkaDStream.map{ case(devId,byteArray) =>(devId,KafkaGenericEvent.parseFrom(byteArray))

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
I guess what I'm asking is why not start with a Byte array like in the example that works (using the DefaultDecoder) then map over it and do the decoding manually like I'm suggesting below. Have you tried this approach? We have the same workflow (kafka => protobuf => custom class) and it works.

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
@Adrian, I am doing collect for debugging purpose. But i have to use foreachRDD so that i can operate on top of this rdd and eventually save to DB. But my actual problem here is to properly convert Array[Byte] to my custom object. On Thu, Sep 17, 2015 at 7:04 PM, Adrian Tanase wrote: > Why are

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Why are you calling foreachRdd / collect in the first place? Instead of using a custom decoder, you should simply do – this is code executed on the workers and allows the computation to continue. ForeachRdd and collect are output operations and force the data to be collected on the driver (assu

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread gaurav sharma
I have run into similar excpetions ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([AdServe,1])) and the issue has happened on Kafka Side, where my broker offsets go out of sync, or do not return l

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Umesh Kacha
Hi Cody sorry my bad you were right there was a typo in topicSet. When I corrected typo in topicSet it started working. Thanks a lot. Regards On Thu, Jul 30, 2015 at 7:43 PM, Cody Koeninger wrote: > Can you post the code including the values of kafkaParams and topicSet, > ideally the relevant o

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Cody Koeninger
Can you post the code including the values of kafkaParams and topicSet, ideally the relevant output of kafka-topics.sh --describe as well On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha wrote: > Hi thanks for the response. Like I already mentioned in the question kafka > topic is valid and it has

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Umesh Kacha
Hi thanks for the response. Like I already mentioned in the question kafka topic is valid and it has data I can see data in it using another kafka consumer. On Jul 30, 2015 7:31 AM, "Cody Koeninger" wrote: > The last time someone brought this up on the mailing list, the issue > actually was that

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Cody Koeninger
The last time someone brought this up on the mailing list, the issue actually was that the topic(s) didn't exist in Kafka at the time the spark job was running. On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das wrote: > There is a known issue that Kafka cannot return leader if there is not > da

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Tathagata Das
There is a known issue that Kafka cannot return leader if there is not data in the topic. I think it was raised in another thread in this forum. Is that the issue? On Wed, Jul 29, 2015 at 10:38 AM, unk1102 wrote: > Hi I have Spark Streaming code which streams from Kafka topic it used to > work >

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
Yes, it should work, let us know if not. On Thu, Jul 9, 2015 at 11:34 AM, Shushant Arora wrote: > Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with > kafka cluster version 0.8.2 then spark streaming 1.3 should also work? > > I have tested standalone consumer kafka consumer 0

Re: spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka cluster version 0.8.2 then spark streaming 1.3 should also work? I have tested standalone consumer kafka consumer 0.8.0 with kafka cluster 0.8.2 and that works. On Thu, Jul 9, 2015 at 9:58 PM, Cody Koeninger wrote: > I

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
It's the consumer version. Should work with 0.8.2 clusters. On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora wrote: > Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not > compatible with kafka 0.8.2 ? > > As per maven dependency of spark streaming 1.3 with kafka > > > org.apache.

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
Hi Cody, That is clear. Thanks! Bill On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger wrote: > If you checkpoint, the job will start from the successfully consumed > offsets. If you don't checkpoint, by default it will start from the > highest available offset, and you will potentially lose da

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
If you checkpoint, the job will start from the successfully consumed offsets. If you don't checkpoint, by default it will start from the highest available offset, and you will potentially lose data. Is the link I posted, or for that matter the scaladoc, really not clear on that point? The scalad

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will it still start to consume the data that were produced to Kafka at 12:01? Or it will just start consuming from the current time? On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger wrote: > Have you read > https://github.co

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
Have you read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ? 1. There's nothing preventing that. 2. Checkpointing will give you at-least-once semantics, provided you have sufficient kafka retention. Be aware that checkpoints aren't recoverable if you upgrade code. On

Re: Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-02 Thread Akhil Das
There was a similar discussion over here http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E Thanks Best Regards On Fri, May 1, 2015 at 7:12 PM, Todd Nist wrote: > *Resending as I do not see that this made i

Re: spark-streaming-kafka with broadcast variable

2014-09-05 Thread Tathagata Das
I am not sure if there is a good, clean way to do that - broadcasts variables are not designed to be used out side spark job closures. You could try a bit of a hacky stuff where you write the serialized variable to file in HDFS / NFS / distributed files sytem, and then use a custom decoder class th

Re: [Spark Streaming] kafka consumer announce

2014-08-29 Thread Evgeniy Shishkin
TD, can you please comment on this code? I am really interested in including this code in Spark. But i am bothering about some point about persistence: 1. When we extend Receiver and call store, is it blocking call? Does it return only when spark stores rdd as requested (i.e. replicated or on

Re: [Spark Streaming] kafka consumer announce

2014-08-21 Thread Evgeniy Shishkin
>> On 21 Aug 2014, at 20:25, Tim Smith wrote: >> >> Thanks. Discovering kafka metadata from zookeeper instead of brokers >> is nicer. Saving metadata and offsets to HBase, is that optional or >> mandatory? >> Can it be made optional (default to zookeeper)? >> For now we implemented and somewhat

RE: [spark-streaming] kafka source and flow control

2014-08-12 Thread Gwenhael Pasquiers
issue with file (hdfs) inputs ? how can I be sure the input won’t “overflow” the process chain ? From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: mardi 12 août 2014 02:58 To: Gwenhael Pasquiers Cc: u...@spark.incubator.apache.org Subject: Re: [spark-streaming] kafka source and flow control

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Xuri Nagarin
In general, (and I am prototyping), I have a better idea :) - Consume kafka in Spark from topic-A - transform data in Spark (normalize, enrich etc etc) - Feed it back to Kafka (in a different topic-B) - Have flume->HDFS (for M/R, Impala, Spark batch) or Spark-streaming or any other compute framewor

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Tobias Pfeiffer
Hi, On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers < gwenhael.pasqui...@ericsson.com> wrote: > > We intend to apply other operations on the data later in the same spark > context, but our first step is to archive it. > > > > Our goal is somth like this > > Step 1 : consume kafka > Step 2 : ar

RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
Subject: Re: [spark-streaming] kafka source and flow control Hi, On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers mailto:gwenhael.pasqui...@ericsson.com>> wrote: I’m using spark-streaming in a cloudera environment to consume a kafka source and store all data into hdfs. I assume you are doing som

RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
[mailto:t...@preferred.jp] Sent: lundi 11 août 2014 11:44 To: Gwenhael Pasquiers Subject: Re: [spark-streaming] kafka source and flow control Hi, On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers mailto:gwenhael.pasqui...@ericsson.com>> wrote: I’m using spark-streaming in a cloudera environm

Re: spark streaming kafka

2014-08-04 Thread Tathagata Das
1. Does your cluster have access to the machines that run kafka? 2. Is there any error in logs? If so can you please post them? TD On Mon, Aug 4, 2014 at 1:12 PM, salemi wrote: > Hi, > > I have the following driver and it works when I run it in the local[*] mode > but if I execute it in a stand

Re: Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi Tobias, Currently, I do not use bundle any dependency into my application jar. I will try that. Thanks a lot! Bill On Tue, Jul 8, 2014 at 5:22 PM, Tobias Pfeiffer wrote: > Bill, > > have you packaged "org.apache.spark" % "spark-streaming-kafka_2.10" % > "1.0.0" into your application jar? I

Re: Spark-streaming-kafka error

2014-07-08 Thread Tobias Pfeiffer
Bill, have you packaged "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.0.0" into your application jar? If I remember correctly, it's not bundled with the downloadable compiled version of Spark. Tobias On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay wrote: > Hi all, > > I used sbt to package

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-24 Thread Andrew Or
Hi all, The short answer is that standalone-cluster mode through spark-submit is broken (and in fact not officially supported). Please use standalone-client mode instead. The long answer is provided here: http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3cCAMJOb8m6gF9B3W=p12hi88mex

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-19 Thread lannyripple
Gino, I can confirm that your solution of assembling with spark-streaming-kafka but excluding spark-core and spark-streaming has me working with basic spark-submit. As mentioned you must specify the assembly jar in the SparkConfig as well as to spark-submit. When I see the error you are now expe

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Gino Bustelo
Luis' experience validates what I'm seeing. You have to still set the properties in the SparkConf for the context to work. For example, master URL and jars are specified again in the app. Gino B. > On Jun 17, 2014, at 12:05 PM, Luis Ángel Vicente Sánchez > wrote: > > I have been able to sub

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Luis Ángel Vicente Sánchez
I have been able to submit a job successfully but I had to config my spark job this way: val sparkConf: SparkConf = new SparkConf() .setAppName("TwitterPopularTags") .setMaster("spark://int-spark-master:7077") .setSparkHome("/opt/spark") .setJars(Seq("/tmp/spark-test-

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Luis Ángel Vicente Sánchez
After playing a bit, I have been able to create a fatjar this way: lazy val rootDependencies = Seq( "org.apache.spark" %% "spark-core" % "1.0.0" % "provided", "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided", "org.apache.spark" %% "spark-streaming-twitter"

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Michael Cutler
Admittedly getting Spark Streaming / Kafka working for the first time can be a bit tricky with the web of dependencies that get pulled in. I've taken the KafkaWorkCount example from the Spark project and set up a simple standalone SBT project that shows you how to get it working and using spark-su

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Gino Bustelo
+1 for this issue. Documentation for spark-submit are misleading. Among many issues, the jar support is bad. HTTP urls do not work. This is because spark is using hadoop's FileSystem class. You have to specify the jars twice to get things to work. Once for the DriverWrapper to laid your classes

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Luis Ángel Vicente Sánchez
Did you manage to make it work? I'm facing similar problems and this a serious blocker issue. spark-submit seems kind of broken to me if you can use it for spark-streaming. Regards, Luis 2014-06-11 1:48 GMT+01:00 lannyripple : > I am using Spark 1.0.0 compiled with Hadoop 1.2.1. > > I have a t

Re: spark streaming kafka output

2014-05-05 Thread Tathagata Das
There is not in-built code in Spark Streaming to output to Kafka yet. However, I have heard people have use Twitter Storehaus with foreachRDD and Storehaus has a kafka output. Something that you might look into. TD On Sun, May 4, 2014 at 11:45 PM, Weide Zhang wrote: > Hi , > > Is there any cod

Re: Spark streaming kafka _output_

2014-04-02 Thread Benjamin Black
please no. On Wed, Apr 2, 2014 at 9:47 PM, Tathagata Das wrote: > If any body is interested is doing this, may I suggested taking a look at > Twitter's Storehaus. It presents an abstract interface for pushing data to > many different backends, including Kafka, mongodb, hbase, etc. Integrating >

Re: Spark streaming kafka _output_

2014-04-02 Thread Tathagata Das
If any body is interested is doing this, may I suggested taking a look at Twitter's Storehaus. It presents an abstract interface for pushing data to many different backends, including Kafka, mongodb, hbase, etc. Integrating DStream.foreachRDD with Storehaus maybe a very very useful thing to do. So

Re: Spark streaming kafka _output_

2014-04-02 Thread Soren Macbeth
Anybody? Seems like a reasonable thing to be able to do no? On Fri, Mar 21, 2014 at 3:58 PM, Benjamin Black wrote: > Howdy, folks! > > Anybody out there having a working kafka _output_ for Spark streaming? > Perhaps one that doesn't involve instantiating a new producer for every > batch? > > Th

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-28 Thread Tathagata Das
The cleaner ttl was introduced as a "brute force" method to clean all old data and metadata in the system, so that the system can run 24/7. The cleaner ttl should be set to a large value, so that RDDs older than that are not used. Though there are some cases where you may want to use an RDD again a

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:44, Tathagata Das wrote: > The more I think about it the problem is not about /tmp, its more about the > workers not having enough memory. Blocks of received data could be falling > out of memory before it is getting processed. > BTW, what is the storage level that you a

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Tathagata Das
The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling out of memory before it is getting processed. BTW, what is the storage level that you are using for your input stream? If you are using MEMORY_ONLY,

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
Heh sorry that wasnt a clear question, I know 'how' to set it but dont know what value to use in a mesos cluster, since the processes are running in lxc containers they wont be sharing a filesystem (or machine for that matter) I cant use an s3n:// url for local dir can I? -- View this message

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Tathagata Das
spark.local.dir should be specified in the same way as other configuration parameters. On Thu, Mar 27, 2014 at 10:32 AM, Scott Clasen wrote: > I think now that this is because spark.local.dir is defaulting to /tmp, and > since the tasks a

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
I think now that this is because spark.local.dir is defaulting to /tmp, and since the tasks are not running on the same machine, the file is not found when the second task takes over. How do you set spark.local.dir appropriately when running on mesos? -- View this message in context: http://ap

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-26 Thread Scott Clasen
The web-ui shows 3 executors, the driver and one spark task on each worker. I do see that there were 8 successful tasks and the ninth failed like so... java.lang.Exception (java.lang.Exception: Could not compute split, block input-0-1395860790200 not found) org.apache.spark.rdd.BlockRDD.compute(B

  1   2   >