Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
Apparently the number set in maxFilesPerTrigger doesn't have any effect on scaling at all.  Again, if all file reading is done by a single node, the Spark streaming isn't really designed for doing real-time processing at all, because that single node becomes a bottleneck... On 10/16/20 3:47 PM

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
That's exactly my question was, whether Spark can do parallel read, not data-frame driven parallel query or processing, because our ML query is very simple, but the data ingestion part seams to be the bottleneck.  Can someone confirm that Spark just can't do parallel read?  If not, what would b

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Lalwani, Jayesh
One you are talking about ML, you aren’t talking about “simple” transformations. Spark is a good platform to do ML on. You can easily configure Spark to read your data in one node, and then run ML transformations in parallel From: Artemis User Date: Friday, October 16, 2020 at 3:52 PM To: "user

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
We can't use AWS since the target production has to be on-prem. The reason we choose Spark is because of its ML libraries.  Lambda would be a great model for stream processing from a functional programming perspective.  Not sure how well can it be integrated with Spark ML or other ML libraries.

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Lalwani, Jayesh
With a file based source, Spark is going to take maximum use of memory before it tries to scaling to more nodes. Parallelization adds overhead. This overhead is negligible if your data is several gigs or above. If your entire data can fit into memory of one node, then it’s better to process ever

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
Thank you all for the responses.  Basically we were dealing with file source (not Kafka, therefore no topics involved) and dumping csv files (about 1000 lines, 300KB per file) at a pretty high speed (10 - 15 files/second) one at a time to the stream source directory.  We have a Spark 3.0.1. clu

Disabling locality for dynamic allocation on Yarn

2020-10-16 Thread Kimahriman
Is there a way to disable locality based container allocation for Yarn dynamic allocation? The issue we're running into is that we have several long running structured streaming jobs running, all using dynamic allocation so they free up resources in between batches. However, if we have a lot of da