Hello,

I have a piece of code that looks roughly like this:

df = spark.read.parquet("s3://bucket/data.parquet/name=A", 
"s3://bucket/data.parquet/name=B")
df_out = df.xxxx     # Do stuff to transform df
df_out.write.partitionBy("name").parquet("s3://bucket/data.parquet")

I specific explicit paths when reading to avoid very long globbing phase, as 
there are many such partitions and I am only interested in a few. I observe 
that whenever Spark does the write, I see the actual write to S3 happen 
(through the output committer), but then the Spark driver spends a lot of time 
listing `s3://bucket/data.parquet/name=A`, 
`s3://bucket/[data.parquet/name=](http://data.parquet/name=A)B` (in practice, I 
have more prefixes, and this leads to parallel lists using Spark tasks, not 
just on the driver)>

I have seen at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L187
 that after the write, the FileIndex is refreshed, which I suspect would be 
triggering that.

However, I am not exactly sure how that whole FileIndex mechanism works: is 
that the FileIndex from the data frame that we have read initially? How can I 
above this refresh, which is not useful in my case as I terminate the job after 
writing?

I've been looking for some documentation on the phenomenon but couldn't find 
anything.

Thanks for the help!

Reply via email to