Anyone else can answer below questions on performance tuning Structured
streaming?
@Jacek?
On Sun, May 3, 2020 at 12:07 AM Srinivas V wrote:
> Hi Alex, read the book , it is a good one but i don’t see things which I
> strongly want to understand.
> You are right on the partition and tasks.
> 1.H
Hi Alex, read the book , it is a good one but i don’t see things which I
strongly want to understand.
You are right on the partition and tasks.
1.How to use coalesce with spark structured streaming ?
Also I want to ask few more questions,
2. How to restrict number of executors on structured stream
Just to clarify - I didn't write this explicitly in my answer. When you're
working with Kafka, every partition in Kafka is mapped into Spark
partition. And in Spark, every partition is mapped into task. But you can
use `coalesce` to decrease the number of Spark partitions, so you'll have
less tas
Thank you Alex. I will check it out and let you know if I have any questions
On Fri, Apr 17, 2020 at 11:36 PM Alex Ott wrote:
> http://shop.oreilly.com/product/0636920047568.do has quite good
> information
> on it. For Kafka, you need to start with approximation that processing of
> each partit
http://shop.oreilly.com/product/0636920047568.do has quite good information
on it. For Kafka, you need to start with approximation that processing of
each partition is a separate task that need to be executed, so you need to
plan number of cores correspondingly.
Srinivas V at "Thu, 16 Apr 2020 2
Hello,
Can someone point me to a good video or document which takes about
performance tuning for structured streaming app?
I am looking especially for listening to Kafka topics say 5 topics each
with 100 portions .
Trying to figure out best cluster size and number of executors and cores
required.
Hi everyone,
I have the following pipeline:
Ingest 2 streams from Kafka -> parse JSON -> join both streams -> aggregate
on a key over the last second -> output to Kafka
with
Join: inner join in interval of one second, with watermarking 50 ms
Aggregation: tumbling window of one second, with waterma
for
outer join etc.).
Assaf.
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Friday, December 23, 2016 2:46 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: streaming performance
From what I understand looking at the code in stackoverflow, I think you are
"simulating" the
%20Streaming%20using%20Scala%20DataFrames%20API.html
This would process each file one by one, maintain internal state to
continuous update the aggregates and never require reprocessing the old
data.
Hope this helps
On Wed, Dec 21, 2016 at 7:58 AM, Mendelson, Assaf
wrote:
> am having trouble
am having trouble with streaming performance. My main problem is how to do a
sliding window calculation where the ratio between the window size and the step
size is relatively large (hundreds) without recalculating everything all the
time.
I created a simple example of what I am aiming at with
; View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27334.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
park 2.0, hoping the partitioning issue is no longer
present there.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27334.html
Sent from the Apache Spark User List mailing list a
ttempting memory settings as
> mentioned
> http://spark.apache.org/docs/latest/configuration.html#memory-management
>
> But its not making a lot of difference. Appreciate your inputs on this
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.10015
on this
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27330.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
ser @spark" mailto:user@spark.apache.org>>
Subject: RE: is dataframe.write() async? Streaming performance problem
Writing (or reading) small files from spark to s3 can be seriously slow.
You'll get much higher throughput by doing a df.foreachPartition(partition =>
...) and inside each parti
extra overhead.
Thanks
Andy
From: Ewan Leith
Date: Friday, July 8, 2016 at 8:52 AM
To: Cody Koeninger , Andrew Davidson
Cc: "user @spark"
Subject: RE: is dataframe.write() async? Streaming performance problem
> Writing (or reading) small files from spark to s3 can be s
is dataframe.write() async? Streaming performance problem
Maybe obvious, but what happens when you change the s3 write to a println of
all the data? That should identify whether it's the issue.
count() and read.json() will involve additional tasks (run through the items in
the rdd to count
Maybe obvious, but what happens when you change the s3 write to a
println of all the data? That should identify whether it's the issue.
count() and read.json() will involve additional tasks (run through the
items in the rdd to count them, likewise to infer the schema) but for
300 records that sho
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka
direct stream approach. I am running into performance problems. My
processing time is > than my window size. Changing window sizes, adding
cores and executor memory does not change performance. I am having a lot of
trouble
ay. We can meet and go into more details there - is
>> anyone working on Spark Streaming available?
>>
>>
>>
>> Cosmin
>>
>>
>>
>>
>>
>> *From: *Mich Talebzadeh
>> *Date: *Saturday 4 June 2016 at 12:33
>> *To: *Florin Bro
gt; anyone working on Spark Streaming available?
>
>
>
> Cosmin
>
>
>
>
>
> *From: *Mich Talebzadeh
> *Date: *Saturday 4 June 2016 at 12:33
> *To: *Florin Broască
> *Cc: *David Newberger , Adrian Tanase <
> atan...@adobe.com>, "user@spark.apache.org&quo
Streaming available?
Cosmin
From: Mich Talebzadeh
Date: Saturday 4 June 2016 at 12:33
To: Florin Broască
Cc: David Newberger , Adrian Tanase
, "user@spark.apache.org" , ciobanu
Subject: Re: [REPOST] Severe Spark Streaming performance degradation after
upgrading to 1.6.1
batch
What does your processing time look like. Is it consistently within that 20sec
micro batch window?
David Newberger
From: Adrian Tanase [mailto:atan...@adobe.com]
Sent: Friday, June 3, 2016 8:14 AM
To: user@spark.apache.org
Cc: Cosmin Ciobanu
Subject: [REPOST] Severe Spark Streaming performance
Hi all,
Trying to repost this question from a colleague on my team, somehow his
subscription is not active:
http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-td27056.html
Appreciate any thoughts,
-adrian
How are you determining how much time is serialization taking?
I made this change in a streaming app that relies heavily on updateStateByKey.
The memory consumption went up 3x on the executors but I can't see any perf
improvement. Task execution time is the same and the serialization state metri
You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the
stateDStream returned by updateStateByKey to achieve the same. As you have
seen, the downside is greater memory usage, and also higher GC overheads
(that;s the main one usually). So I suggest you run your benchmarks for a
long enough t
While investigating performance challenges in a Streaming application using
UpdateStateByKey, I found that serialization of state was a meaningful (not
dominant) portion of our execution time.
In StateDStream.scala, serialized persistence is required:
super.persist(StorageLevel.MEMORY_ONLY_S
I am not sure why you are getting node_local and not process_local. Also
there is probably not a good documentation other than that configuration
page - http://spark.apache.org/docs/latest/configuration.html (search for
locality)
On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert
wrote:
>
>
>
>
>
>
>
What were the number of cores in the executor? It could be that you had
only one core in the executor which did all the 50 tasks serially so 50
task X 15 ms = ~ 1 second.
Could you take a look at the task details in the stage page to see when the
tasks were added to see whether it explains the 5 se
Hi,
I've developed a POC Spark Streaming application.
But it seems to perform better on my development machine than on our cluster.
I submit it to yarn on our cloudera cluster.
But my first question is more detailed:
In de application UI (:4040) I see in the streaming section that the batch
p
30 matches
Mail list logo