Anbu Cheeralan created SPARK-18917:
--------------------------------------
Summary: Dataframe - Time Out Issues / Taking long time in append
mode on object stores
Key: SPARK-18917
URL: https://issues.apache.org/jira/browse/SPARK-18917
Project: Spark
Issue Type: Improvement
Components: EC2, SQL, YARN
Affects Versions: 2.0.2
Reporter: Anbu Cheeralan
Priority: Minor
Fix For: 2.1.0, 2.1.1
When using Dataframe write in append mode on object stores (S3 / Google
Storage), the writes are taking long time to write/ getting read time out. This
is because dataframe.write lists all leaf folders in the target directory. If
there are lot of subfolders due to partitions, this is taking for ever.
The code is In org.apache.spark.sql.execution.datasources.DataSource.write()
following code causes huge number of RPC calls when the file system is an
Object Store (S3, GS).
if (mode == SaveMode.Append) {
val existingPartitionColumns = Try {
resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
}.getOrElse(Seq.empty[String])
There should be a flag to skip Partition Match Check in append mode. I can work
on the patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]