Spark Streaming Small files in Hive

KhajaAsmath Mohammed Sun, 29 Oct 2017 08:04:09 -0700

Hi,

I am using spark streaming to write data back into hive with the below code
snippet



eventHubsWindowedStream.map(x => EventContent(new String(x)))

      .foreachRDD(rdd => {

        val sparkSession = SparkSession
.builder.enableHiveSupport.getOrCreate

        import sparkSession.implicits._

        rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append
).insertInto(hiveTableName)

      })

Hive table is partitioned by year,month,day so we end up getting less data
for some days and it in turn results in smaller files inside hive. Since
the data is being written in smaller files, there is lot of performance on
Impala/Hive when reading it? is there a way to merge files while inserting
data into hive?

It would be really helpful too if you anyone can provide suggestions on how
to design it in better way. we cannot use Hbase/kudu in this current
scenario due to space issue with clusters .

Thanks,

Asmath

Spark Streaming Small files in Hive

Reply via email to