A related issue - When I put multiple topics in a single stream, the
processing delay is as bad as the slowest task in the number of tasks
created. Even though the topics are unrelated to each other, RDD at time
"t1" has to wait for the RDD at "t0" is fully executed, even if most
cores are idlin
hanks!
-neelesh
l maintain the
> ordering guarantees offered by Kafka.
>
> if this is true, then I'd suggest @neelesh create more partitions within
> the Kafka Topic to improve parallelism - same as any distributed,
> partitioned data processing engine including spark.
>
> if this is not
it. If you're having the second
> problem, use different spark jobs for different topics.
>
> On Sun, Dec 20, 2015 at 2:28 PM, Neelesh wrote:
>
>> @Chris,
>> There is a 1-1 mapping b/w spark partitions & kafka partitions out of
>> the box . One can break
ak to the possibility of getting task
> failures added to listener callbacks.
>
> On Sat, Dec 19, 2015 at 5:44 PM, Neelesh wrote:
>
>> Hi,
>> I'm trying to build automatic Kafka watermark handling in my stream
>> apps by overriding the KafkaRDDIterator, and ad
I also created a JIRA for task failures
https://issues.apache.org/jira/browse/SPARK-12452
On Mon, Dec 21, 2015 at 9:54 AM, Neelesh wrote:
> I am leaning towards something like that. Things get interesting when
> multiple different transformations and regrouping happen. At the end of it
spark.streaming.concurrentJobs may help. Experimental according to TD from
an older thread here
http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming
On Sat, Feb 20, 2016 at 11:24 AM, Jorge Rodriguez
wrote:
>
> Is it possible to have the scheduler sche
en on
taskCompletedEvent on the driver and even figure out that there was an
error, there is no way of mapping this task back to the partition and
retrieving offset range, topic & kafka partition # etc.
Any pointers appreciated!
Thanks!
-neelesh
k
inside the task execution code, in cases where the intermediate operations
do not change partitions, shuffle etc.
-neelesh
On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger wrote:
>
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-re
ly kill the job.
>
> On Fri, Sep 25, 2015 at 1:55 PM, Neelesh wrote:
>
>> Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
>> trying to achieve
>>
>> stream.foreachRDD {rdd=>
>>rdd.foreachPartition { p=>
>>
>&
As Cody says, to achieve true exactly once, the book keeping has to happen
in the sink data system, that too assuming its a transactional store.
Wherever possible, we try to make the application idempotent (upsert in
HBase, ignore-on-duplicate for MySQL etc), but there are still cases
(analytics, c
e:
> Yes, the partition IDs are the same.
>
> As far as the failure / subclassing goes, you may want to keep an eye on
> https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the
> suggestions in there will end up going anywhere.
>
> On Fri, Sep 25, 2015 at 3:01 PM, Neele
We're planning to use this as well (Dibyendu's
https://github.com/dibbhatt/kafka-spark-consumer ). Dibyendu, thanks for
the efforts. So far its working nicely. I think there is merit in make it
the default Kafka Receiver for spark streaming.
-neelesh
On Mon, Feb 2, 2015 at 5:25 PM
There does not seem to be a definitive answer on this. Every time I google
for message ordering,the only relevant thing that comes up is this -
http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html
.
With a kafka receiver that pulls data from a single kafka partition of
rate limiting right now is static and
cannot adapt to the state of the cluster
thnx
-neelesh
On Wed, Feb 18, 2015 at 4:13 PM, jay vyas
wrote:
> This is a *fantastic* question. The idea of how we identify individual
> things in multiple DStreams is worth looking at.
>
> The reaso
titions and kafka
> partitions. So you will get deterministic ordering, but only on a
> per-partition basis.
>
> On Thu, Feb 19, 2015 at 11:31 PM, Neelesh wrote:
>
>> I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose.
>> We talked a bit about thi
eed to as well, for unrelated reasons). Maybe you
> can say a little more about your use case.
>
> But regardless of the technology you're using to read from kafka (spark,
> storm, whatever), kafka only gives you ordering as to a particular
> partition. So you're going to ne
ng individual inserts.
>>
>> Also be aware that output actions aren't guaranteed to happen exactly
>> once, so you'll need to store unique offset ids in mysql or otherwise deal
>> with the possibility of executor failures.
>>
>>
>> On Fri, Feb 20, 2
Hi,
My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but
spark uses 4.2.6 , and I believe thats what's causing the following error.
I've tried setting
spark.executor.userClassPathFirst & spark.driver.userClassPathFirst to true
in the config, but that does not solve it either. Fina
does it
run on workers?
Any help appreciated
thanks!
-neelesh
irect streams
work yet
Another thread - Kafka 0.8.2 supports non ZK offset management , which I
think is more scalable than bombarding ZK. I'm working on supporting the
new offset management strategy for Kafka with kafka-spark-consumer.
Thanks!
-neelesh
On Wed, Apr 1, 2015 at 9:49 AM
.
Thanks again!
On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger wrote:
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>
> The kafka consumers run in the executors.
>
> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh wrote:
>
>> With receivers, it was p
ate so you cant subclass it without
> building your own spark.
>
> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh wrote:
>
>> Thanks Cody, that was really helpful. I have a much better understanding
>> now. One last question - Kafka topics are initialized once in the driver,
>
his gets called every
>> batch interval before the offsets are decided. This would allow users to
>> add topics, delete topics, modify partitions on the fly.
>>
>> What do you think Cody?
>>
>>
>>
>>
>> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh wrote
Somewhat agree on subclassing and its issues. It looks like the alternative
in spark 1.3.0 to create a custom build. Is there an enhancement filed for
this? If not, I'll file one.
Thanks!
-neelesh
On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das wrote:
> The challenge of opening
ls with
the above error.
Any advice on how I can go about debugging/ solving this?
--
Regards,
Neelesh S. Salian
ion on how to run Spark job written in scala through Oozie.
>>
>> Would really appreciate the help.
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
--
Neelesh Srinivas Salian
Customer Operations Engineer
> machine learning etc. or wait for a new edition covering the new concepts
>>> like dataframe and datasets. Anyone got any suggestions?
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
--
Regards,
Neelesh S. Salian
nabble.com/checkpointing-without-streaming-tp4541p28691.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -----
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
--
*Regards,Neelesh SambhajicheMobile: 8058437181*
[image: Inline image 1]
*Birla Institute of Technology & Science,* Pilani
Pilani Campus, Rajasthan 333 031, INDIA
Does it still hit the memory limit for the container?
An expensive transformation?
On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu wrote:
> In master branch, overhead is now 10%.
> That would be 500 MB
>
> FYI
>
>
>
> > On Apr 22, 2015, at 8:26 AM, nsalian wrote:
> >
> > +1 to executor-memory to 5g.
>
30 matches
Mail list logo