[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380126#comment-16380126 ]
Kevin Zhang commented on SPARK-14974: ------------------------------------- I encountered the same problem with [~ussraf] in spark 2.2 and 2.3, and I'm not quite sure about how to fix it. Is there any plan to reopen the issue? > spark sql job create too many files in HDFS when doing insert overwrite hive > table > ---------------------------------------------------------------------------------- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.5.2 > Reporter: zenglinxi > Priority: Minor > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous > pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 2,000,000. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. > Can we create configuration parameters to limit the maximum number of files > allowed to be create by each task or limit the spark_sql_partitions_num > without affect the concurrency? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org