Re: How to Scale Streaming Application to Multiple Workers

Artemis User Fri, 16 Oct 2020 12:52:21 -0700

We can't use AWS since the target production has to be on-prem. Thereason we choose Spark is because of its ML libraries. Lambda would bea great model for stream processing from a functional programmingperspective. Not sure how well can it be integrated with Spark ML orother ML libraries. Any suggestions would be highly appreciated..

ND


On 10/16/20 2:49 PM, Lalwani, Jayesh wrote:

With a file based source, Spark is going to take maximum use of memorybefore it tries to scaling to more nodes. Parallelization addsoverhead. This overhead is negligible if your data is several gigs orabove. If your entire data can fit into memory of one node, then it’sbetter to process everything in one node. Forcing Spark to parallelizeprocessing that can be done in a single node will reduce throughput.
You are right, though. Spark is overkill for a simple transformationfor a 300KB file. A lot of people implement simple transformationsusing serverless AWS Lambda. Spark’s power comes in when you arejoining streaming sources and/or joining streaming sources with batchsources. It’s not that Spark can’t do simple transformations. It’sperfectly capable of doing it. It make sense to implement simpletransformations in Spark if you have a data pipeline that isimplemented in Spark, and this ingestion is one of many other thingsthat you do with Spark. But, if your entire pipeline consists ofingestion of small files, then you might be better off with simplersolutions.
*From: *Artemis User <arte...@dtechspace.com>
*Date: *Friday, October 16, 2020 at 2:19 PM
*Cc: *user <user@spark.apache.org>
*Subject: *RE: [EXTERNAL] How to Scale Streaming Application toMultiple Workers
*CAUTION*: This email originated from outside of the organization. Donot click links or open attachments unless you can confirm the senderand know the content is safe.
Thank you all for the responses. Basically we were dealing with filesource (not Kafka, therefore no topics involved) and dumping csv files(about 1000 lines, 300KB per file) at a pretty high speed (10 - 15files/second) one at a time to the stream source directory. We have aSpark 3.0.1. cluster configured with 4 workers, each one is allocatedwith 4 cores. We tried numerous options, including setting thespark.streaming.dynamicAllocation.enabled parameter to true, andsetting the maxFilesPerTrigger to 1, but were unable to scale the#cores*#workers >4.
What I am trying to understand is that what makes spark to allocatejobs to more workers? Is it based on the size of the data frame,batch sizes or trigger intervals? Looks like the Spark masterscheduler doesn't consider the number of input files waiting to beprocessed, only consider the data size (i.e. the size of data frames)that has been read or already imported, before allocating newworkers. If that that case, then Spark really missed the point andwasn't really designed for real-time streaming applications. I couldwrite my own stream processor that would distribute the load based onthe number of input files, given the fact, that each batch query isatomic/independent from each other..
Thanks in advance for your comment/input.

ND

On 10/15/20 7:13 PM, muru wrote:

    File streaming in SS, you can try setting "maxFilesPerTrigger" per
    batch. The forEachBatch is an action, the output is written to
    various sinks. Are you doing any post transformation in forEachBatch?

    On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh
    <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote:

        Hi,

        This in general depends on how many topics you want to process
        at the same time and whether this is done on-premise running
        Spark in cluster mode.

        Have you looked at Spark GUI to see if one worker (one JVM) is
        adequate for the task?

        Also how these small files are read and processed. Is it the
        same data microbatched? Spark streaming does not process one
        event at a time which is in general I think what people call
        "Streaming." It instead processes groups of events. Each group
        is a "MicroBatch" that gets processed at the same time.


        What parameters (BatchInterval, WindowsLength,SlidingInterval)
        are you using?

        Parallelism helps when you have reasonably large data and your
        cores are running on different sections of data in parallel. 
        Roughly how much do you have in every CSV file

        HTH,

        Mich

        LinkedIn
        
///https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

        *Disclaimer:* Use it at your own risk.Any and all
        responsibility for any loss, damage or destruction of data or
        any other property which may arise from relying on this
        email's technical content is explicitly disclaimed. The author
        will in no case be liable for any monetary damages arising
        from such loss, damage or destruction.

        On Thu, 15 Oct 2020 at 20:02, Artemis User
        <arte...@dtechspace.com <mailto:arte...@dtechspace.com>> wrote:

            Thanks for the input.  What I am interested is how to have
            multiple
            workers to read and process the small files in parallel,
            and certainly
            one file per worker at a time.  Partitioning data frame
            doesn't make
            sense since the data frame is small already.

            On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
            > Parallelism of streaming depends on the input source. If
            you are getting one small file per microbatch, then Spark
            will read it in one worker. You can always repartition
            your data frame after reading it to increase the parallelism.
            >
            > On 10/14/20, 11:26 PM, "Artemis User"
            <arte...@dtechspace.com <mailto:arte...@dtechspace.com>>
            wrote:
            >
            >      CAUTION: This email originated from outside of the
            organization. Do not click links or open attachments
            unless you can confirm the sender and know the content is
            safe.
            >
            >
            >
            >      Hi,
            >
            >      We have a streaming application that read
            microbatch csv files and
            >      involves the foreachBatch call.  Each microbatch
            can be processed
            >      independently.  I noticed that only one worker node
            is being utilized.
            >      Is there anyway or any explicit method to
            distribute the batch work load
            >      to multiple workers?  I would think Spark would
            execute foreachBatch
            >      method on different workers since each batch can be
            treated as atomic?
            >
            >      Thanks!
            >
            >      ND
            >
            >
            >
            
---------------------------------------------------------------------
            >      To unsubscribe e-mail:
            user-unsubscr...@spark.apache.org
            <mailto:user-unsubscr...@spark.apache.org>
            >
            >

            
---------------------------------------------------------------------
            To unsubscribe e-mail: user-unsubscr...@spark.apache.org
            <mailto:user-unsubscr...@spark.apache.org>

Re: How to Scale Streaming Application to Multiple Workers

Reply via email to