Re: Kafka - streaming from multiple topics

2015-12-19 Thread Neelesh
A related issue - When I put multiple topics in a single stream, the processing delay is as bad as the slowest task in the number of tasks created. Even though the topics are unrelated to each other, RDD at time "t1" has to wait for the RDD at "t0" is fully executed, even if most cores are idlin

TaskCompletionListener and Exceptions

2015-12-19 Thread Neelesh
hanks! -neelesh

Re: Kafka - streaming from multiple topics

2015-12-20 Thread Neelesh
l maintain the > ordering guarantees offered by Kafka. > > if this is true, then I'd suggest @neelesh create more partitions within > the Kafka Topic to improve parallelism - same as any distributed, > partitioned data processing engine including spark. > > if this is not

Re: Kafka - streaming from multiple topics

2015-12-21 Thread Neelesh
it. If you're having the second > problem, use different spark jobs for different topics. > > On Sun, Dec 20, 2015 at 2:28 PM, Neelesh wrote: > >> @Chris, >> There is a 1-1 mapping b/w spark partitions & kafka partitions out of >> the box . One can break

Re: TaskCompletionListener and Exceptions

2015-12-21 Thread Neelesh
ak to the possibility of getting task > failures added to listener callbacks. > > On Sat, Dec 19, 2015 at 5:44 PM, Neelesh wrote: > >> Hi, >> I'm trying to build automatic Kafka watermark handling in my stream >> apps by overriding the KafkaRDDIterator, and ad

Re: TaskCompletionListener and Exceptions

2015-12-21 Thread Neelesh
I also created a JIRA for task failures https://issues.apache.org/jira/browse/SPARK-12452 On Mon, Dec 21, 2015 at 9:54 AM, Neelesh wrote: > I am leaning towards something like that. Things get interesting when > multiple different transformations and regrouping happen. At the end of it

Re: Spark Streaming: Is it possible to schedule multiple active batches?

2016-02-20 Thread Neelesh
spark.streaming.concurrentJobs may help. Experimental according to TD from an older thread here http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming On Sat, Feb 20, 2016 at 11:24 AM, Jorge Rodriguez wrote: > > Is it possible to have the scheduler sche

Kafka & Spark Streaming

2015-09-25 Thread Neelesh
en on taskCompletedEvent on the driver and even figure out that there was an error, there is no way of mapping this task back to the partition and retrieving offset range, topic & kafka partition # etc. Any pointers appreciated! Thanks! -neelesh

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
k inside the task execution code, in cases where the intermediate operations do not change partitions, shuffle etc. -neelesh On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger wrote: > > http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-re

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
ly kill the job. > > On Fri, Sep 25, 2015 at 1:55 PM, Neelesh wrote: > >> Thanks Petr, Cody. This is a reasonable place to start for me. What I'm >> trying to achieve >> >> stream.foreachRDD {rdd=> >>rdd.foreachPartition { p=> >> >&

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Neelesh
As Cody says, to achieve true exactly once, the book keeping has to happen in the sink data system, that too assuming its a transactional store. Wherever possible, we try to make the application idempotent (upsert in HBase, ignore-on-duplicate for MySQL etc), but there are still cases (analytics, c

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
e: > Yes, the partition IDs are the same. > > As far as the failure / subclassing goes, you may want to keep an eye on > https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the > suggestions in there will end up going anywhere. > > On Fri, Sep 25, 2015 at 3:01 PM, Neele

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Neelesh
We're planning to use this as well (Dibyendu's https://github.com/dibbhatt/kafka-spark-consumer ). Dibyendu, thanks for the efforts. So far its working nicely. I think there is merit in make it the default Kafka Receiver for spark streaming. -neelesh On Mon, Feb 2, 2015 at 5:25 PM

Spark Streaming and message ordering

2015-02-18 Thread Neelesh
There does not seem to be a definitive answer on this. Every time I google for message ordering,the only relevant thing that comes up is this - http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html . With a kafka receiver that pulls data from a single kafka partition of

Re: Spark Streaming and message ordering

2015-02-19 Thread Neelesh
rate limiting right now is static and cannot adapt to the state of the cluster thnx -neelesh On Wed, Feb 18, 2015 at 4:13 PM, jay vyas wrote: > This is a *fantastic* question. The idea of how we identify individual > things in multiple DStreams is worth looking at. > > The reaso

Re: Spark Streaming and message ordering

2015-02-19 Thread Neelesh
titions and kafka > partitions. So you will get deterministic ordering, but only on a > per-partition basis. > > On Thu, Feb 19, 2015 at 11:31 PM, Neelesh wrote: > >> I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose. >> We talked a bit about thi

Re: Spark Streaming and message ordering

2015-02-20 Thread Neelesh
eed to as well, for unrelated reasons). Maybe you > can say a little more about your use case. > > But regardless of the technology you're using to read from kafka (spark, > storm, whatever), kafka only gives you ordering as to a particular > partition. So you're going to ne

Re: Spark Streaming and message ordering

2015-02-20 Thread Neelesh
ng individual inserts. >> >> Also be aware that output actions aren't guaranteed to happen exactly >> once, so you'll need to store unique offset ids in mysql or otherwise deal >> with the possibility of executor failures. >> >> >> On Fri, Feb 20, 2

Untangling dependency issues in spark streaming

2015-03-29 Thread Neelesh
Hi, My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but spark uses 4.2.6 , and I believe thats what's causing the following error. I've tried setting spark.executor.userClassPathFirst & spark.driver.userClassPathFirst to true in the config, but that does not solve it either. Fina

Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
does it run on workers? Any help appreciated thanks! -neelesh

Re: Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Neelesh
irect streams work yet Another thread - Kafka 0.8.2 supports non ZK offset management , which I think is more scalable than bombarding ZK. I'm working on supporting the new offset management strategy for Kafka with kafka-spark-consumer. Thanks! -neelesh On Wed, Apr 1, 2015 at 9:49 AM

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
. Thanks again! On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger wrote: > https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md > > The kafka consumers run in the executors. > > On Wed, Apr 1, 2015 at 11:18 AM, Neelesh wrote: > >> With receivers, it was p

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
ate so you cant subclass it without > building your own spark. > > On Wed, Apr 1, 2015 at 1:09 PM, Neelesh wrote: > >> Thanks Cody, that was really helpful. I have a much better understanding >> now. One last question - Kafka topics are initialized once in the driver, >

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
his gets called every >> batch interval before the offsets are decided. This would allow users to >> add topics, delete topics, modify partitions on the fly. >> >> What do you think Cody? >> >> >> >> >> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh wrote

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-06 Thread Neelesh
Somewhat agree on subclassing and its issues. It looks like the alternative in spark 1.3.0 to create a custom build. Is there an enhancement filed for this? If not, I'll file one. Thanks! -neelesh On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das wrote: > The challenge of opening

[Spark 3.0.0] Job fails with NPE - worked in Spark 2.4.4

2020-07-23 Thread Neelesh Salian
ls with the above error. Any advice on how I can go about debugging/ solving this? -- Regards, Neelesh S. Salian

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Neelesh Salian
ion on how to run Spark job written in scala through Oozie. >> >> Would really appreciate the help. >> >> >> >> Thanks, >> Divya >> > > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net > > > -- Neelesh Srinivas Salian Customer Operations Engineer

Re: Spark books

2017-05-03 Thread Neelesh Salian
> machine learning etc. or wait for a new edition covering the new concepts >>> like dataframe and datasets. Anyone got any suggestions? >>> >> >> > > > -- > Best Regards, > Ayan Guha > -- Regards, Neelesh S. Salian

Re: checkpointing without streaming?

2017-05-18 Thread Neelesh Sambhajiche
nabble.com/checkpointing-without-streaming-tp4541p28691.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> ----- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- *Regards,Neelesh SambhajicheMobile: 8058437181* [image: Inline image 1] *Birla Institute of Technology & Science,* Pilani Pilani Campus, Rajasthan 333 031, INDIA

Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu wrote: > In master branch, overhead is now 10%. > That would be 500 MB > > FYI > > > > > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > > > +1 to executor-memory to 5g. >