Hello,

On Spark 2.4.4, I am using DataFrameReader#csv to read about 300000 files on 
S3, and I've noticed that it takes about an hour for it to load the data on the 
Driver. You can see the timestamp difference when the log from 
InMemoryFileIndex occurs from 7:45 to 8:54:
19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
...
19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under: [300K files...]

I believe that the issue comes from DataSource#checkAndGlobPathIfNecessary [0], 
specifically from when it is calling FileSystem#exists. Unlike 
bulkListLeafFiles, the existence check here happens in a single-threaded 
flatMap, which is a blocking network call if your files are stored on S3.

I believe that there is a fairly straightforward opportunity for improvement 
here, which is to parallelize the existence check perhaps with a configurable 
number of threads. If that seems reasonable, I would like to create a JIRA 
ticket and submit a patch. Please let me know!

Cheers,

Arwin

[0] 
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557

Reply via email to