Hello Team, 

 

I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB. 



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored. 

2. Hive table each partition folder size is 200 MB. 

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins. 

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder. 

 

I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation. 





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to