Hello Asmath,

We had a similar challenge recently.

When you write back to hive, you are creating files on HDFS, and it depends on 
your batch window. 
If you increase your batch window lets say from 1 min to 5 mins you will end up 
creating 5x times less.

The other factor is your partitioning. For instance, if your spark application 
is working on 5 partitions, you can repartition to 1, this will again reduce 
the number of files to 5x.

You can create staging to hold small files and once a decent amount of data is 
accumulated you can prepare large files and load to your final hive table.

hope this helps.

Regards
Shiv


> On Oct 29, 2017, at 11:03 AM, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I am using spark streaming to write data back into hive with the below code 
> snippet
> 
> 
> eventHubsWindowedStream.map(x => EventContent(new String(x)))
> 
>       .foreachRDD(rdd => {
> 
>         val sparkSession = SparkSession.builder.enableHiveSupport.getOrCreate
> 
>         import sparkSession.implicits._
> 
>         
> rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto(hiveTableName)
> 
>       })
> 
> 
> Hive table is partitioned by year,month,day so we end up getting less data 
> for some days and it in turn results in smaller files inside hive. Since the 
> data is being written in smaller files, there is lot of performance on 
> Impala/Hive when reading it? is there a way to merge files while inserting 
> data into hive?
> 
> It would be really helpful too if you anyone can provide suggestions on how 
> to design it in better way. we cannot use Hbase/kudu in this current scenario 
> due to space issue with clusters .
> 
> Thanks,
> 
> Asmath

Reply via email to