Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these files, I end up with 315K tasks for a dataframe reading a few days worth of data.
I now with a regular Spark job, I can use coalesce to come to a lower number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I have a line in hive-conf that says to use CombinedInputFormat but I'm not sure it's working. (Obviously haivng fewer large files is better but I don't control the file generation side of this) Tips much appreciated