Apparently the number set in maxFilesPerTrigger doesn't have any effect
on scaling at all. Again, if all file reading is done by a single node,
the Spark streaming isn't really designed for doing real-time processing
at all, because that single node becomes a bottleneck...
On 10/16/20 3:47 PM
That's exactly my question was, whether Spark can do parallel read, not
data-frame driven parallel query or processing, because our ML query is
very simple, but the data ingestion part seams to be the bottleneck.
Can someone confirm that Spark just can't do parallel read? If not,
what would b
One you are talking about ML, you aren’t talking about “simple”
transformations. Spark is a good platform to do ML on. You can easily configure
Spark to read your data in one node, and then run ML transformations in parallel
From: Artemis User
Date: Friday, October 16, 2020 at 3:52 PM
To: "user
We can't use AWS since the target production has to be on-prem. The
reason we choose Spark is because of its ML libraries. Lambda would be
a great model for stream processing from a functional programming
perspective. Not sure how well can it be integrated with Spark ML or
other ML libraries.
With a file based source, Spark is going to take maximum use of memory before
it tries to scaling to more nodes. Parallelization adds overhead. This overhead
is negligible if your data is several gigs or above. If your entire data can
fit into memory of one node, then it’s better to process ever
Thank you all for the responses. Basically we were dealing with file
source (not Kafka, therefore no topics involved) and dumping csv files
(about 1000 lines, 300KB per file) at a pretty high speed (10 - 15
files/second) one at a time to the stream source directory. We have a
Spark 3.0.1. clu
Is there a way to disable locality based container allocation for Yarn
dynamic allocation?
The issue we're running into is that we have several long running structured
streaming jobs running, all using dynamic allocation so they free up
resources in between batches. However, if we have a lot of da