Ah... makes sense, thank you. i tried sortWithinPartition before and
replaced with sort. It was a mistake.

чт, 25 февр. 2021 г. в 15:25, Pietro Gentile <
pietro.gentile89.develo...@gmail.com>:

> Hi,
>
> It is because of *repartition* before the *sort* method invocation. If
> you reverse them you'll see 5 output files.
>
> Regards,
> Pietro
>
> Il giorno mer 24 feb 2021 alle ore 16:43 Ivan Petrov <capacyt...@gmail.com>
> ha scritto:
>
>> Hi, I'm trying to control the size and/or count of spark output.
>>
>> Here is my code. I expect to get 5 files  but I get dozens of small files.
>> Why?
>>
>> dataset
>> .repartition(5)
>> .sort("long_repeated_string_in_this_column") // should be better
>> compressed with snappy
>> .write
>> .parquet(outputPath)
>>
>

Reply via email to