Hello Team,
I am new to Spark environment. I have converted Hive query to Spark Scala. Now I am loading data and doing performance testing. Below are details on loading 3 weeks data. Cluster level small file avg size is set to 128 MB. 1. New temp table where I am loading data is ORC formatted as current Hive table is ORC stored. 2. Hive table each partition folder size is 200 MB. 3. I am using repartition(1) in spark code so that it will create one 200MB part file in each partition folder(to avoid small file issue). With this job is completing in 23 to 26 mins. 4. If I don't use repartition(), job is completing in 12 to 13 mins. But problem with this approach is, it is creating 800 part files (size <128MB) in each partition folder. I am quite not sure on how to reduce processing time and not create small files at the same time. Could anyone please help me in this situation. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org