Hi,

I would like to briefly present you my use case and gather possible useful
suggestions from the community. I am developing a spark job which massively
read from and write to Hive. Usually, I use 200 executors with 12g memory
each and a parallelism level of 600. The main run of the application
consists of phases: read from hdfs, persist, small and simple aggregations,
write to hdfs. These steps are repeated a certain number of time.
When I write to Hive, I aim to have partitions of approximately 50/70mb,
therefore I repartition before writing in output in approximately 15 parts
(according to the data size). The writing phase takes around 1.5 minutes;
this means that for 1.5 minutes only 15 out of 600 possible active tasks
are running in parallel. This looks a big waste of resources. How would you
solve the problem?

I am trying to experiment with the FAIR scheduler and job pools, but it
seems not improving a lot; for some reasons, I cannot have more than 4
parallel jobs running. I am investigating this opportunity right now, maybe
I will provide more details about it afterwards.

I would like to know if this use case is normal, what would you do and if
in your opinion I am doing something wrong.

Thanks,
*Alessandro Liparoti*

Reply via email to