Hi, I would like to briefly present you my use case and gather possible useful suggestions from the community. I am developing a spark job which massively read from and write to Hive. Usually, I use 200 executors with 12g memory each and a parallelism level of 600. The main run of the application consists of phases: read from hdfs, persist, small and simple aggregations, write to hdfs. These steps are repeated a certain number of time. When I write to Hive, I aim to have partitions of approximately 50/70mb, therefore I repartition before writing in output in approximately 15 parts (according to the data size). The writing phase takes around 1.5 minutes; this means that for 1.5 minutes only 15 out of 600 possible active tasks are running in parallel. This looks a big waste of resources. How would you solve the problem?
I am trying to experiment with the FAIR scheduler and job pools, but it seems not improving a lot; for some reasons, I cannot have more than 4 parallel jobs running. I am investigating this opportunity right now, maybe I will provide more details about it afterwards. I would like to know if this use case is normal, what would you do and if in your opinion I am doing something wrong. Thanks, *Alessandro Liparoti*