Ah, sorry for spamming, I found the answer from documentation. Thank you
for the clarification!

Best regards, Aki Riisiö

On Thu, 27 Jan 2022 at 10:39, Aki Riisiö <aki.rii...@gmail.com> wrote:

> Hello.
>
> Thank you for the reply again. I just checked how many tasks are spawned
> when we read the data from S3 and in the latest run, this was a little over
> 23000. What determines the amount of tasks during the read? Is it directly
> corresponding to the number of files to be read?
>
> Thank you.
>
> On Tue, 25 Jan 2022 at 17:35, Sean Owen <sro...@gmail.com> wrote:
>
>> Yes, you will end up with 80 partitions, and if you write the result, you
>> end up with 80 files. If you don't have at least 80 partitions, there is no
>> point in have 80 cores. You will probably see 56 are idle even under load.
>> The partitionBy might end up causing the whole job to have more
>> partitions anyway. I would settle this by actually watching how many tasks
>> the streaming job spawns. Is it 1, 24, more?
>>
>> On Tue, Jan 25, 2022 at 7:57 AM Aki Riisiö <aki.rii...@gmail.com> wrote:
>>
>>> Thank you for the reply.
>>> The stream is partitioned by year/month/day/hour, and we read the data
>>> once a day, so we are reading 24 partitions.
>>>
>>> " A crude rule of thumb is to have 2-3x as many tasks as cores" thank
>>> you very much, I will set this as default. Will this however change, if we
>>> also partition the data by year/month/day/hour? If I set:
>>> df.repartition(80),write ... partitionBy("year", "month", "day",
>>> "hour"), will this cause each hour to have 80 output files?
>>>
>>> The output data in a "normal" run is very small, so a big partition size
>>> would result in a large number of too small files.
>>> I am not sure how Glue autoscales itself, but I definitely need to look
>>> that up a bit more.
>>>
>>> One of our jobs actually has a requirement to have only one output-file,
>>> so is the only way to achieve that by repartition(1)? As I understand it,
>>> this is a major issue in performance.
>>>
>>> Thank you!
>>>
>>>
>>> On Tue, 25 Jan 2022 at 15:29, Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> How many partitions does the stream have? With 80 cores, you need at
>>>> least 80 tasks to even take advantage of them, so if it's less than 80, at
>>>> least .repartition(80). A crude rule of thumb is to have 2-3x as many tasks
>>>> as cores, to help even out differences in task size by more finely
>>>> distributing the work. You might even go for more. I'd watch the task
>>>> length, and as long as the tasks aren't completing in a few seconds or
>>>> less, you probably don't have too many.
>>>>
>>>> This is also a good reason to use autoscaling, so that when not busy
>>>> you can (for example) scale down to 1 executor, but under load, scale up to
>>>> 10 or 20 machines if needed. That is also a good reason to repartition
>>>> more, so that it's possible to take advantage of more parallelism when
>>>> needed.
>>>>
>>>> On Tue, Jan 25, 2022 at 7:07 AM Aki Riisiö <aki.rii...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> We have a very simple AWS Glue job running with Spark: get some events
>>>>> from Kafka stream, do minor transformations, and write to S3.
>>>>>
>>>>> Recently, there was a change in Kafka topic which suddenly increased
>>>>> our data size * 10 and at the same time we were testing with different
>>>>> repartition values during df.repartition(n).write ...
>>>>> At the time when Kafka started sending an increased volume of data, we
>>>>> didn't actually have the repartition value set in our write.
>>>>> Suddenly, our Glue job (or save at NativeMethodAccessorImpl.java:0)
>>>>> jumped from 2h to 10h. Here are some details of the save stage from 
>>>>> SparkUI:
>>>>> - Only 5 executors, which can run 16 tasks parallel each
>>>>> - 10500 tasks (job is still running...) with medians for
>>>>> duration=2,6min and GC time= 2s
>>>>> - Input size per executor is 9GB and output is 4,5GB
>>>>> - executor memory is 20GB
>>>>>
>>>>> My question is now that we're trying to find a proper value for
>>>>> repartition, what would be the optimal value here? Our data volume was not
>>>>> expected to go this high, but there are times when it might be. As this 
>>>>> job
>>>>> is running in AWS Glue, should I also consider setting the executor 
>>>>> amount,
>>>>> cores, and memory manually? I think Glue is actually setting those based 
>>>>> on
>>>>> the Glue job configuration. Yes, this is not probably your concern but
>>>>> still, worth a shot :)
>>>>>
>>>>> Thank you!
>>>>>
>>>>>

Reply via email to