Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When
spark reads these files, I end up with 315K tasks for a dataframe reading a
few days worth of data.

I now with a regular Spark job, I can use coalesce to come to a lower
number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I
have a line in hive-conf that says to use CombinedInputFormat but I'm not
sure it's working.

(Obviously haivng fewer large files is better but I don't control the file
generation side of this)

Tips much appreciated

Reply via email to