Apparently the number set in maxFilesPerTrigger doesn't have any effect on scaling at all.  Again, if all file reading is done by a single node, the Spark streaming isn't really designed for doing real-time processing at all, because that single node becomes a bottleneck...

On 10/16/20 3:47 PM, muru wrote:
You should set the maxFilesPerTrigger to be more than 1 if you want to process a lot of files otherwise Spark will process one file at a time. Since the file size is 300KB and 4 cores/worker, you should set the maxFilesPerTrigger = 4 or more.  (1 core per file)
Try out and let me know if it helps.

On Fri, Oct 16, 2020 at 10:37 AM Artemis User <arte...@dtechspace.com <mailto:arte...@dtechspace.com>> wrote:

    Thank you all for the responses.  Basically we were dealing with
    file source (not Kafka, therefore no topics involved) and dumping
    csv files (about 1000 lines, 300KB per file) at a pretty high
    speed (10 - 15 files/second) one at a time to the stream source
    directory.  We have a Spark 3.0.1. cluster configured with 4
    workers, each one is allocated with 4 cores.  We tried numerous
    options, including setting the
    spark.streaming.dynamicAllocation.enabled parameter to true, and
    setting the maxFilesPerTrigger to 1, but were unable to scale the
    #cores*#workers >4.

    What I am trying to understand is that what makes spark to
    allocate jobs to more workers?  Is it based on the size of the
    data frame, batch sizes or trigger intervals? Looks like the Spark
    master scheduler doesn't consider the number of input files
    waiting to be processed, only consider the data size (i.e. the
    size of data frames) that has been read or already imported,
    before allocating new workers.  If that that case, then Spark
    really missed the point and wasn't really designed for real-time
    streaming applications.  I could write my own stream processor
    that would distribute the load based on the number of input files,
    given the fact, that each batch query is atomic/independent from
    each other..

    Thanks in advance for your comment/input.

    ND

    On 10/15/20 7:13 PM, muru wrote:
    File streaming in SS, you can try setting "maxFilesPerTrigger"
    per batch. The forEachBatch is an action, the output is written
    to various sinks. Are you doing any post transformation in
    forEachBatch?

    On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh
    <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote:

        Hi,

        This in general depends on how many topics you want to
        process at the same time and whether this is done on-premise
        running Spark in cluster mode.

        Have you looked at Spark GUI to see if one worker (one JVM)
        is adequate for the task?

        Also how these small files are read and processed. Is it the
        same data microbatched? Spark streaming does not process one
        event at a time which is in general I think what people call
        "Streaming." It instead processes groups of events. Each
        group is a "MicroBatch" that gets processed at the same time.


        What parameters (BatchInterval,
        WindowsLength,SlidingInterval) are you using?


        Parallelism helps when you have reasonably large data and
        your cores are running on different sections of data in
        parallel.  Roughly how much do you have in every CSV file


        HTH,


        Mich


        LinkedIn
        
/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/



        *Disclaimer:* Use it at your own risk.Any and all
        responsibility for any loss, damage or destruction of data or
        any other property which may arise from relying on this
        email's technical content is explicitly disclaimed. The
        author will in no case be liable for any monetary damages
        arising from such loss, damage or destruction.



        On Thu, 15 Oct 2020 at 20:02, Artemis User
        <arte...@dtechspace.com <mailto:arte...@dtechspace.com>> wrote:

            Thanks for the input.  What I am interested is how to
            have multiple
            workers to read and process the small files in parallel,
            and certainly
            one file per worker at a time.  Partitioning data frame
            doesn't make
            sense since the data frame is small already.

            On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
            > Parallelism of streaming depends on the input source.
            If you are getting one small file per microbatch, then
            Spark will read it in one worker. You can always
            repartition your data frame after reading it to increase
            the parallelism.
            >
            > On 10/14/20, 11:26 PM, "Artemis User"
            <arte...@dtechspace.com <mailto:arte...@dtechspace.com>>
            wrote:
            >
            >      CAUTION: This email originated from outside of the
            organization. Do not click links or open attachments
            unless you can confirm the sender and know the content is
            safe.
            >
            >
            >
            >      Hi,
            >
            >      We have a streaming application that read
            microbatch csv files and
            >      involves the foreachBatch call.  Each microbatch
            can be processed
            >      independently.  I noticed that only one worker
            node is being utilized.
            >      Is there anyway or any explicit method to
            distribute the batch work load
            >      to multiple workers?  I would think Spark would
            execute foreachBatch
            >      method on different workers since each batch can
            be treated as atomic?
            >
            >      Thanks!
            >
            >      ND
            >
            >
            >
            
---------------------------------------------------------------------
            >      To unsubscribe e-mail:
            user-unsubscr...@spark.apache.org
            <mailto:user-unsubscr...@spark.apache.org>
            >
            >

            
---------------------------------------------------------------------
            To unsubscribe e-mail: user-unsubscr...@spark.apache.org
            <mailto:user-unsubscr...@spark.apache.org>

Reply via email to