Hi,

I am using spark streaming to write data back into hive with the below code
snippet


eventHubsWindowedStream.map(x => EventContent(new String(x)))

      .foreachRDD(rdd => {

        val sparkSession = SparkSession
.builder.enableHiveSupport.getOrCreate

        import sparkSession.implicits._

        rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append
).insertInto(hiveTableName)

      })

Hive table is partitioned by year,month,day so we end up getting less data
for some days and it in turn results in smaller files inside hive. Since
the data is being written in smaller files, there is lot of performance on
Impala/Hive when reading it? is there a way to merge files while inserting
data into hive?

It would be really helpful too if you anyone can provide suggestions on how
to design it in better way. we cannot use Hbase/kudu in this current
scenario due to space issue with clusters .

Thanks,

Asmath

Reply via email to