Hey everyone,
I have a Hive table that has a lot of small parquet files and I am creating
a data frame out of it to do some processing, but since I have a large
number of splits/files my job creates a lot of tasks, which I don't want.
Basically what I want is the same functionality that Hive provides, that
is, to combine these small input splits into larger ones by specifying a
max split size setting. Is this currently possible with Spark?

While exploring whether I can use coalesce I hit another issue. With
coalesce I can only control the number of output files not their sizes. And
since the total input dataset size can vary significantly in my case, I
cannot just use a fixed partition count as the size of each output can get
very large. I looked for getting the total input size from an rdd to come
up with some heuristic to set the partition count, but I couldn't find any
ways to do it.

Any help is appreciated.

Thanks,

Nezih
​

Reply via email to