Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
Apparently the number set in maxFilesPerTrigger doesn't have any effect on scaling at all.  Again, if all file reading is done by a single node, the Spark streaming isn't really designed for doing real-time processing at all, because that single node becomes a bottleneck... On 10/16/20 3:47 PM

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
That's exactly my question was, whether Spark can do parallel read, not data-frame driven parallel query or processing, because our ML query is very simple, but the data ingestion part seams to be the bottleneck.  Can someone confirm that Spark just can't do parallel read?  If not, what would b

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Lalwani, Jayesh
One you are talking about ML, you aren’t talking about “simple” transformations. Spark is a good platform to do ML on. You can easily configure Spark to read your data in one node, and then run ML transformations in parallel From: Artemis User Date: Friday, October 16, 2020 at 3:52 PM To: "user

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
We can't use AWS since the target production has to be on-prem. The reason we choose Spark is because of its ML libraries.  Lambda would be a great model for stream processing from a functional programming perspective.  Not sure how well can it be integrated with Spark ML or other ML libraries.

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Lalwani, Jayesh
With a file based source, Spark is going to take maximum use of memory before it tries to scaling to more nodes. Parallelization adds overhead. This overhead is negligible if your data is several gigs or above. If your entire data can fit into memory of one node, then it’s better to process ever

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
Thank you all for the responses.  Basically we were dealing with file source (not Kafka, therefore no topics involved) and dumping csv files (about 1000 lines, 300KB per file) at a pretty high speed (10 - 15 files/second) one at a time to the stream source directory.  We have a Spark 3.0.1. clu

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread muru
File streaming in SS, you can try setting "maxFilesPerTrigger" per batch. The forEachBatch is an action, the output is written to various sinks. Are you doing any post transformation in forEachBatch? On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh wrote: > Hi, > > This in general depends on how

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread Mich Talebzadeh
Hi, This in general depends on how many topics you want to process at the same time and whether this is done on-premise running Spark in cluster mode. Have you looked at Spark GUI to see if one worker (one JVM) is adequate for the task? Also how these small files are read and processed. Is it th

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread Artemis User
Thanks for the input.  What I am interested is how to have multiple workers to read and process the small files in parallel, and certainly one file per worker at a time.  Partitioning data frame doesn't make sense since the data frame is small already. On 10/15/20 9:14 AM, Lalwani, Jayesh wrot

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread Lalwani, Jayesh
Parallelism of streaming depends on the input source. If you are getting one small file per microbatch, then Spark will read it in one worker. You can always repartition your data frame after reading it to increase the parallelism. On 10/14/20, 11:26 PM, "Artemis User" wrote: CAUTION: Thi