Spark distribute loads to executors and the executors are usually
pre-configured with the number of cores. You may want to check with
your Spark admin on how many executors (or slaves) your Spark cluster is
configured with and how many cores are pre-configured for executors.
The debugging tool for performance tuning in Spark would be the built-in
Web UI.
The level of parallel processing in structured streaming isn't as
straightforward as standard ETL processing. It depends on the data
source, streaming mode (continuous or microbatch), your trigger timing,
etc. We have experienced similar scaling problems with structured
streaming. Please note that Spark is designed for processing large data
chunks, not for streaming type of data one piece at a time. It doesn't
like small piece of data (the default partition size is set to 128 MB),
period! The partition mechanism and its RDD-driven DAG Job scheduler
are all designed for processing large-scale data for ETL. It has to
accumulate streaming data into a large chunk first, before scaling can
take place. Apparently Spark can't distribute the read operation either
(only one worker, and it has to do with preserving the order of stream
data). So your data ingestion becomes a bottleneck that prevents from
scaling down the chain. The alternatives may be to look into other
streaming frameworks, like Apache Ignite..
-- ND
On 10/29/20 8:02 PM, Eric Beabes wrote:
We're using Spark 2.4. We recently pushed to production a product
that's using Spark Structured Streaming. It's working well most of the
time but occasionally, when the load is high, we've noticed that there
are only 10+ 'Active Tasks' even though we've provided 128 cores.
Would like to debug this further. Why are all the Cores not getting
used? How do we debug this? Please help. Thanks.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org