Anbu Cheeralan created SPARK-18917:
--------------------------------------

             Summary: Dataframe - Time Out Issues / Taking long time in append 
mode on object stores
                 Key: SPARK-18917
                 URL: https://issues.apache.org/jira/browse/SPARK-18917
             Project: Spark
          Issue Type: Improvement
          Components: EC2, SQL, YARN
    Affects Versions: 2.0.2
            Reporter: Anbu Cheeralan
            Priority: Minor
             Fix For: 2.1.0, 2.1.1


When using Dataframe write in append mode on object stores (S3 / Google 
Storage), the writes are taking long time to write/ getting read time out. This 
is because dataframe.write lists all leaf folders in the target directory. If 
there are lot of subfolders due to partitions, this is taking for ever.

The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
following code causes huge number of RPC calls when the file system is an 
Object Store (S3, GS).
if (mode == SaveMode.Append) {
val existingPartitionColumns = Try {
resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
}.getOrElse(Seq.empty[String])
There should be a flag to skip Partition Match Check in append mode. I can work 
on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to