[Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

Jerry Lam Sat, 24 Oct 2015 18:36:07 -0700

Hi Spark users and developers,

Does anyone encounter any issue when a spark SQL job produces a lot of
files (over 1 millions), the job hangs on the refresh method? I'm using
spark 1.5.1. Below is the stack trace. I saw the parquet files are produced
but the driver is doing something very intensively (it uses all the cpus).
Does it mean Spark SQL cannot be used to produce over 1 million files in a
single job?


Thread 528: (state = BLOCKED)
 - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled frame)
 - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130
(Compiled frame)
 - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12,
line=114 (Compiled frame)
 - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19,
line=415 (Compiled frame)
 - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132
(Compiled frame)
 - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled frame)
 -
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
@bci=4, line=447 (Compiled frame)
 -
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
@bci=5, line=447 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
@bci=9, line=244 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
@bci=2, line=244 (Compiled frame)
 -
scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
scala.Function1) @bci=22, line=33 (Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2,
line=108 (Compiled frame)
 -
scala.collection.TraversableLike$class.map(scala.collection.TraversableLike,
scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244
(Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1,
scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
 -
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
@bci=279, line=447 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh()
@bci=8, line=453 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
@bci=26, line=465 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
@bci=12, line=463 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1, line=540
(Interpreted frame)
 -
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh()
@bci=1, line=204 (Interpreted frame)
 -
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
@bci=392, line=152 (Interpreted frame)
 -
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
@bci=1, line=108 (Interpreted frame)
 -
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
@bci=1, line=108 (Interpreted frame)
 -
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96,
line=56 (Interpreted frame)
 -
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(org.apache.spark.sql.SQLContext)
@bci=718, line=108 (Interpreted frame)
 -
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute()
@bci=20, line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult()
@bci=15, line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.doExecute() @bci=12,
line=69 (Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply()
@bci=11, line=140 (Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply()
@bci=1, line=138 (Interpreted frame)
 -
org.apache.spark.rdd.RDDOperationScope$.withScope(org.apache.spark.SparkContext,
java.lang.String, boolean, boolean, scala.Function0) @bci=131, line=147
(Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan.execute() @bci=189, line=138
(Interpreted frame)
 - org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute()
@bci=21, line=933 (Interpreted frame)
 - org.apache.spark.sql.SQLContext$QueryExecution.toRdd() @bci=13, line=933
(Interpreted frame)
 -
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(org.apache.spark.sql.SQLContext,
java.lang.String, java.lang.String[], org.apache.spark.sql.SaveMode,
scala.collection.immutable.Map, org.apache.spark.sql.DataFrame) @bci=293,
line=197 (Interpreted frame)
 - org.apache.spark.sql.DataFrameWriter.save() @bci=64, line=146
(Interpreted frame)
 - org.apache.spark.sql.DataFrameWriter.save(java.lang.String) @bci=24,
line=137 (Interpreted frame)
 - org.apache.spark.sql.DataFrameWriter.parquet(java.lang.String) @bci=8,
line=304 (Interpreted frame)

Best Regards,

Jerry

[Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

Reply via email to