Hi, I am using spark streaming to write data back into hive with the below code snippet
eventHubsWindowedStream.map(x => EventContent(new String(x))) .foreachRDD(rdd => { val sparkSession = SparkSession .builder.enableHiveSupport.getOrCreate import sparkSession.implicits._ rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append ).insertInto(hiveTableName) }) Hive table is partitioned by year,month,day so we end up getting less data for some days and it in turn results in smaller files inside hive. Since the data is being written in smaller files, there is lot of performance on Impala/Hive when reading it? is there a way to merge files while inserting data into hive? It would be really helpful too if you anyone can provide suggestions on how to design it in better way. we cannot use Hbase/kudu in this current scenario due to space issue with clusters . Thanks, Asmath