Apparently the number set in maxFilesPerTrigger doesn't have any effect
on scaling at all. Again, if all file reading is done by a single node,
the Spark streaming isn't really designed for doing real-time processing
at all, because that single node becomes a bottleneck...
On 10/16/20 3:47 PM
That's exactly my question was, whether Spark can do parallel read, not
data-frame driven parallel query or processing, because our ML query is
very simple, but the data ingestion part seams to be the bottleneck.
Can someone confirm that Spark just can't do parallel read? If not,
what would b
One you are talking about ML, you aren’t talking about “simple”
transformations. Spark is a good platform to do ML on. You can easily configure
Spark to read your data in one node, and then run ML transformations in parallel
From: Artemis User
Date: Friday, October 16, 2020 at 3:52 PM
To: "user
We can't use AWS since the target production has to be on-prem. The
reason we choose Spark is because of its ML libraries. Lambda would be
a great model for stream processing from a functional programming
perspective. Not sure how well can it be integrated with Spark ML or
other ML libraries.
With a file based source, Spark is going to take maximum use of memory before
it tries to scaling to more nodes. Parallelization adds overhead. This overhead
is negligible if your data is several gigs or above. If your entire data can
fit into memory of one node, then it’s better to process ever
Thank you all for the responses. Basically we were dealing with file
source (not Kafka, therefore no topics involved) and dumping csv files
(about 1000 lines, 300KB per file) at a pretty high speed (10 - 15
files/second) one at a time to the stream source directory. We have a
Spark 3.0.1. clu
File streaming in SS, you can try setting "maxFilesPerTrigger" per batch.
The forEachBatch is an action, the output is written to various sinks. Are
you doing any post transformation in forEachBatch?
On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh
wrote:
> Hi,
>
> This in general depends on how
Hi,
This in general depends on how many topics you want to process at the same
time and whether this is done on-premise running Spark in cluster mode.
Have you looked at Spark GUI to see if one worker (one JVM) is adequate for
the task?
Also how these small files are read and processed. Is it th
Thanks for the input. What I am interested is how to have multiple
workers to read and process the small files in parallel, and certainly
one file per worker at a time. Partitioning data frame doesn't make
sense since the data frame is small already.
On 10/15/20 9:14 AM, Lalwani, Jayesh wrot
Parallelism of streaming depends on the input source. If you are getting one
small file per microbatch, then Spark will read it in one worker. You can
always repartition your data frame after reading it to increase the parallelism.
On 10/14/20, 11:26 PM, "Artemis User" wrote:
CAUTION: Thi
10 matches
Mail list logo