Ah, sorry for spamming, I found the answer from documentation. Thank you for the clarification!
Best regards, Aki Riisiö On Thu, 27 Jan 2022 at 10:39, Aki Riisiö <aki.rii...@gmail.com> wrote: > Hello. > > Thank you for the reply again. I just checked how many tasks are spawned > when we read the data from S3 and in the latest run, this was a little over > 23000. What determines the amount of tasks during the read? Is it directly > corresponding to the number of files to be read? > > Thank you. > > On Tue, 25 Jan 2022 at 17:35, Sean Owen <sro...@gmail.com> wrote: > >> Yes, you will end up with 80 partitions, and if you write the result, you >> end up with 80 files. If you don't have at least 80 partitions, there is no >> point in have 80 cores. You will probably see 56 are idle even under load. >> The partitionBy might end up causing the whole job to have more >> partitions anyway. I would settle this by actually watching how many tasks >> the streaming job spawns. Is it 1, 24, more? >> >> On Tue, Jan 25, 2022 at 7:57 AM Aki Riisiö <aki.rii...@gmail.com> wrote: >> >>> Thank you for the reply. >>> The stream is partitioned by year/month/day/hour, and we read the data >>> once a day, so we are reading 24 partitions. >>> >>> " A crude rule of thumb is to have 2-3x as many tasks as cores" thank >>> you very much, I will set this as default. Will this however change, if we >>> also partition the data by year/month/day/hour? If I set: >>> df.repartition(80),write ... partitionBy("year", "month", "day", >>> "hour"), will this cause each hour to have 80 output files? >>> >>> The output data in a "normal" run is very small, so a big partition size >>> would result in a large number of too small files. >>> I am not sure how Glue autoscales itself, but I definitely need to look >>> that up a bit more. >>> >>> One of our jobs actually has a requirement to have only one output-file, >>> so is the only way to achieve that by repartition(1)? As I understand it, >>> this is a major issue in performance. >>> >>> Thank you! >>> >>> >>> On Tue, 25 Jan 2022 at 15:29, Sean Owen <sro...@gmail.com> wrote: >>> >>>> How many partitions does the stream have? With 80 cores, you need at >>>> least 80 tasks to even take advantage of them, so if it's less than 80, at >>>> least .repartition(80). A crude rule of thumb is to have 2-3x as many tasks >>>> as cores, to help even out differences in task size by more finely >>>> distributing the work. You might even go for more. I'd watch the task >>>> length, and as long as the tasks aren't completing in a few seconds or >>>> less, you probably don't have too many. >>>> >>>> This is also a good reason to use autoscaling, so that when not busy >>>> you can (for example) scale down to 1 executor, but under load, scale up to >>>> 10 or 20 machines if needed. That is also a good reason to repartition >>>> more, so that it's possible to take advantage of more parallelism when >>>> needed. >>>> >>>> On Tue, Jan 25, 2022 at 7:07 AM Aki Riisiö <aki.rii...@gmail.com> >>>> wrote: >>>> >>>>> Hello. >>>>> >>>>> We have a very simple AWS Glue job running with Spark: get some events >>>>> from Kafka stream, do minor transformations, and write to S3. >>>>> >>>>> Recently, there was a change in Kafka topic which suddenly increased >>>>> our data size * 10 and at the same time we were testing with different >>>>> repartition values during df.repartition(n).write ... >>>>> At the time when Kafka started sending an increased volume of data, we >>>>> didn't actually have the repartition value set in our write. >>>>> Suddenly, our Glue job (or save at NativeMethodAccessorImpl.java:0) >>>>> jumped from 2h to 10h. Here are some details of the save stage from >>>>> SparkUI: >>>>> - Only 5 executors, which can run 16 tasks parallel each >>>>> - 10500 tasks (job is still running...) with medians for >>>>> duration=2,6min and GC time= 2s >>>>> - Input size per executor is 9GB and output is 4,5GB >>>>> - executor memory is 20GB >>>>> >>>>> My question is now that we're trying to find a proper value for >>>>> repartition, what would be the optimal value here? Our data volume was not >>>>> expected to go this high, but there are times when it might be. As this >>>>> job >>>>> is running in AWS Glue, should I also consider setting the executor >>>>> amount, >>>>> cores, and memory manually? I think Glue is actually setting those based >>>>> on >>>>> the Glue job configuration. Yes, this is not probably your concern but >>>>> still, worth a shot :) >>>>> >>>>> Thank you! >>>>> >>>>>