Hi, This in general depends on how many topics you want to process at the same time and whether this is done on-premise running Spark in cluster mode.
Have you looked at Spark GUI to see if one worker (one JVM) is adequate for the task? Also how these small files are read and processed. Is it the same data microbatched? Spark streaming does not process one event at a time which is in general I think what people call "Streaming." It instead processes groups of events. Each group is a "MicroBatch" that gets processed at the same time. What parameters (BatchInterval, WindowsLength,SlidingInterval) are you using? Parallelism helps when you have reasonably large data and your cores are running on different sections of data in parallel. Roughly how much do you have in every CSV file HTH, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 15 Oct 2020 at 20:02, Artemis User <arte...@dtechspace.com> wrote: > Thanks for the input. What I am interested is how to have multiple > workers to read and process the small files in parallel, and certainly > one file per worker at a time. Partitioning data frame doesn't make > sense since the data frame is small already. > > On 10/15/20 9:14 AM, Lalwani, Jayesh wrote: > > Parallelism of streaming depends on the input source. If you are getting > one small file per microbatch, then Spark will read it in one worker. You > can always repartition your data frame after reading it to increase the > parallelism. > > > > On 10/14/20, 11:26 PM, "Artemis User" <arte...@dtechspace.com> wrote: > > > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > > > > > Hi, > > > > We have a streaming application that read microbatch csv files and > > involves the foreachBatch call. Each microbatch can be processed > > independently. I noticed that only one worker node is being > utilized. > > Is there anyway or any explicit method to distribute the batch work > load > > to multiple workers? I would think Spark would execute foreachBatch > > method on different workers since each batch can be treated as > atomic? > > > > Thanks! > > > > ND > > > > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >