What I understand is you have a source location where files are dropped and never removed? If that is the case, you may want to keep a track of which files are already processed by your program and read only the "new" files. On 3 Aug 2016 22:03, "Yana Kadiyska" <yana.kadiy...@gmail.com> wrote:
> Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When > spark reads these files, I end up with 315K tasks for a dataframe reading a > few days worth of data. > > I now with a regular Spark job, I can use coalesce to come to a lower > number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I > have a line in hive-conf that says to use CombinedInputFormat but I'm not > sure it's working. > > (Obviously haivng fewer large files is better but I don't control the file > generation side of this) > > Tips much appreciated >