Re: How to Scale Streaming Application to Multiple Workers

Mich Talebzadeh Thu, 15 Oct 2020 13:24:25 -0700

Hi,

This in general depends on how many topics you want to process at the same
time and whether this is done on-premise running Spark in cluster mode.

Have you looked at Spark GUI to see if one worker (one JVM) is adequate for
the task?

Also how these small files are read and processed. Is it the same data
microbatched?  Spark streaming does not process one event at a time which
is in general I think what people call "Streaming." It instead processes
groups of events. Each group is a "MicroBatch" that gets processed at the
same time.

What parameters (BatchInterval, WindowsLength,SlidingInterval) are you
using?

Parallelism helps when you have reasonably large data and your cores are
running on different sections of data in parallel.  Roughly how much do you
have in every CSV file

HTH,

Mich

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 15 Oct 2020 at 20:02, Artemis User <arte...@dtechspace.com> wrote:

> Thanks for the input.  What I am interested is how to have multiple
> workers to read and process the small files in parallel, and certainly
> one file per worker at a time.  Partitioning data frame doesn't make
> sense since the data frame is small already.
>
> On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
> > Parallelism of streaming depends on the input source. If you are getting
> one small file per microbatch, then Spark will read it in one worker. You
> can always repartition your data frame after reading it to increase the
> parallelism.
> >
> > On 10/14/20, 11:26 PM, "Artemis User" <arte...@dtechspace.com> wrote:
> >
> >      CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
> >
> >
> >
> >      Hi,
> >
> >      We have a streaming application that read microbatch csv files and
> >      involves the foreachBatch call.  Each microbatch can be processed
> >      independently.  I noticed that only one worker node is being
> utilized.
> >      Is there anyway or any explicit method to distribute the batch work
> load
> >      to multiple workers?  I would think Spark would execute foreachBatch
> >      method on different workers since each batch can be treated as
> atomic?
> >
> >      Thanks!
> >
> >      ND
> >
> >
> >
> ---------------------------------------------------------------------
> >      To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: How to Scale Streaming Application to Multiple Workers

Reply via email to