Re: Spark structured streaming - performance tuning

2020-05-08 Thread Srinivas V
Anyone else can answer below questions on performance tuning Structured streaming? @Jacek? On Sun, May 3, 2020 at 12:07 AM Srinivas V wrote: > Hi Alex, read the book , it is a good one but i don’t see things which I > strongly want to understand. > You are right on the partition and tasks. > 1.H

Re: Spark structured streaming - performance tuning

2020-05-02 Thread Srinivas V
Hi Alex, read the book , it is a good one but i don’t see things which I strongly want to understand. You are right on the partition and tasks. 1.How to use coalesce with spark structured streaming ? Also I want to ask few more questions, 2. How to restrict number of executors on structured stream

Re: Spark structured streaming - performance tuning

2020-04-18 Thread Alex Ott
Just to clarify - I didn't write this explicitly in my answer. When you're working with Kafka, every partition in Kafka is mapped into Spark partition. And in Spark, every partition is mapped into task. But you can use `coalesce` to decrease the number of Spark partitions, so you'll have less tas

Re: Spark structured streaming - performance tuning

2020-04-17 Thread Srinivas V
Thank you Alex. I will check it out and let you know if I have any questions On Fri, Apr 17, 2020 at 11:36 PM Alex Ott wrote: > http://shop.oreilly.com/product/0636920047568.do has quite good > information > on it. For Kafka, you need to start with approximation that processing of > each partit

Re: Spark structured streaming - performance tuning

2020-04-17 Thread Alex Ott
http://shop.oreilly.com/product/0636920047568.do has quite good information on it. For Kafka, you need to start with approximation that processing of each partition is a separate task that need to be executed, so you need to plan number of cores correspondingly. Srinivas V at "Thu, 16 Apr 2020 2

Spark structured streaming - performance tuning

2020-04-16 Thread Srinivas V
Hello, Can someone point me to a good video or document which takes about performance tuning for structured streaming app? I am looking especially for listening to Kafka topics say 5 topics each with 100 portions . Trying to figure out best cluster size and number of executors and cores required.

Structured streaming performance issues

2019-02-21 Thread gvdongen
Hi everyone, I have the following pipeline: Ingest 2 streams from Kafka -> parse JSON -> join both streams -> aggregate on a key over the last second -> output to Kafka with Join: inner join in interval of one second, with watermarking 50 ms Aggregation: tumbling window of one second, with waterma

RE: streaming performance

2016-12-25 Thread Mendelson, Assaf
for outer join etc.). Assaf. From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Friday, December 23, 2016 2:46 AM To: Mendelson, Assaf Cc: user Subject: Re: streaming performance From what I understand looking at the code in stackoverflow, I think you are "simulating" the

Re: streaming performance

2016-12-22 Thread Tathagata Das
%20Streaming%20using%20Scala%20DataFrames%20API.html This would process each file one by one, maintain internal state to continuous update the aggregates and never require reprocessing the old data. Hope this helps On Wed, Dec 21, 2016 at 7:58 AM, Mendelson, Assaf wrote: > am having trouble

streaming performance

2016-12-21 Thread Mendelson, Assaf
am having trouble with streaming performance. My main problem is how to do a sliding window calculation where the ratio between the window size and the step size is relatively large (hundreds) without recalculating everything all the time. I created a simple example of what I am aiming at with

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread Sunita Arvind
; View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27334.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread CosminC
park 2.0, hoping the partitioning issue is no longer present there. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27334.html Sent from the Apache Spark User List mailing list a

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Dibyendu Bhattacharya
ttempting memory settings as > mentioned > http://spark.apache.org/docs/latest/configuration.html#memory-management > > But its not making a lot of difference. Appreciate your inputs on this > > > > -- > View this message in context: > http://apache-spark-user-list.10015

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Sunita
on this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27330.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: can I use ExectorService in my driver? was: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Ewan Leith
ser @spark" mailto:user@spark.apache.org>> Subject: RE: is dataframe.write() async? Streaming performance problem Writing (or reading) small files from spark to s3 can be seriously slow. You'll get much higher throughput by doing a df.foreachPartition(partition => ...) and inside each parti

can I use ExectorService in my driver? was: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Andy Davidson
extra overhead. Thanks Andy From: Ewan Leith Date: Friday, July 8, 2016 at 8:52 AM To: Cody Koeninger , Andrew Davidson Cc: "user @spark" Subject: RE: is dataframe.write() async? Streaming performance problem > Writing (or reading) small files from spark to s3 can be s

RE: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Ewan Leith
is dataframe.write() async? Streaming performance problem Maybe obvious, but what happens when you change the s3 write to a println of all the data? That should identify whether it's the issue. count() and read.json() will involve additional tasks (run through the items in the rdd to count

Re: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Cody Koeninger
Maybe obvious, but what happens when you change the s3 write to a println of all the data? That should identify whether it's the issue. count() and read.json() will involve additional tasks (run through the items in the rdd to count them, likewise to infer the schema) but for 300 records that sho

is dataframe.write() async? Streaming performance problem

2016-07-07 Thread Andy Davidson
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka direct stream approach. I am running into performance problems. My processing time is > than my window size. Changing window sizes, adding cores and executor memory does not change performance. I am having a lot of trouble

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-07 Thread Daniel Darabos
ay. We can meet and go into more details there - is >> anyone working on Spark Streaming available? >> >> >> >> Cosmin >> >> >> >> >> >> *From: *Mich Talebzadeh >> *Date: *Saturday 4 June 2016 at 12:33 >> *To: *Florin Bro

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-05 Thread Daniel Darabos
gt; anyone working on Spark Streaming available? > > > > Cosmin > > > > > > *From: *Mich Talebzadeh > *Date: *Saturday 4 June 2016 at 12:33 > *To: *Florin Broască > *Cc: *David Newberger , Adrian Tanase < > atan...@adobe.com>, "user@spark.apache.org&quo

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-04 Thread Cosmin Ciobanu
Streaming available? Cosmin From: Mich Talebzadeh Date: Saturday 4 June 2016 at 12:33 To: Florin Broască Cc: David Newberger , Adrian Tanase , "user@spark.apache.org" , ciobanu Subject: Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1 batch

RE: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-03 Thread David Newberger
What does your processing time look like. Is it consistently within that 20sec micro batch window? David Newberger From: Adrian Tanase [mailto:atan...@adobe.com] Sent: Friday, June 3, 2016 8:14 AM To: user@spark.apache.org Cc: Cosmin Ciobanu Subject: [REPOST] Severe Spark Streaming performance

[REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-03 Thread Adrian Tanase
Hi all, Trying to repost this question from a colleague on my team, somehow his subscription is not active: http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-td27056.html Appreciate any thoughts, -adrian

Re: Streaming Performance w/ UpdateStateByKey

2015-10-10 Thread Adrian Tanase
How are you determining how much time is serialization taking? I made this change in a streaming app that relies heavily on updateStateByKey. The memory consumption went up 3x on the executors but I can't see any perf improvement. Task execution time is the same and the serialization state metri

Re: Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Tathagata Das
You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the stateDStream returned by updateStateByKey to achieve the same. As you have seen, the downside is greater memory usage, and also higher GC overheads (that;s the main one usually). So I suggest you run your benchmarks for a long enough t

Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Jeff Nadler
While investigating performance challenges in a Streaming application using UpdateStateByKey, I found that serialization of state was a meaningful (not dominant) portion of our execution time. In StateDStream.scala, serialized persistence is required: super.persist(StorageLevel.MEMORY_ONLY_S

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
I am not sure why you are getting node_local and not process_local. Also there is probably not a good documentation other than that configuration page - http://spark.apache.org/docs/latest/configuration.html (search for locality) On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert wrote: > > > > > > >

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
What were the number of cores in the executor? It could be that you had only one core in the executor which did all the 50 tasks serially so 50 task X 15 ms = ~ 1 second. Could you take a look at the task details in the stage page to see when the tasks were added to see whether it explains the 5 se

spark streaming performance

2015-07-09 Thread Michel Hubert
Hi, I've developed a POC Spark Streaming application. But it seems to perform better on my development machine than on our cluster. I submit it to yarn on our cloudera cluster. But my first question is more detailed: In de application UI (:4040) I see in the streaming section that the batch p