Ah... makes sense, thank you. i tried sortWithinPartition before and replaced with sort. It was a mistake.
чт, 25 февр. 2021 г. в 15:25, Pietro Gentile < pietro.gentile89.develo...@gmail.com>: > Hi, > > It is because of *repartition* before the *sort* method invocation. If > you reverse them you'll see 5 output files. > > Regards, > Pietro > > Il giorno mer 24 feb 2021 alle ore 16:43 Ivan Petrov <capacyt...@gmail.com> > ha scritto: > >> Hi, I'm trying to control the size and/or count of spark output. >> >> Here is my code. I expect to get 5 files but I get dozens of small files. >> Why? >> >> dataset >> .repartition(5) >> .sort("long_repeated_string_in_this_column") // should be better >> compressed with snappy >> .write >> .parquet(outputPath) >> >