I think the problem is calling globStatus to expand all 300K files. This is a general problem for object stores and huge numbers of files. Steve L. may have better thoughts on real solutions. But you might consider, if possible, running a lot of .csv jobs in parallel to query subsets of all the files, and union the results. At least there you parallelize the reading from the object store.
I think it's hard to optimize this case from the Spark side as it's not clear how big a glob like s3://foo/* is going to be. I think it would take reimplementing some logic to expand the glob incrementally or something. Or maybe I am overlooking optimizations that have gone into Spark 3. On Fri, Sep 6, 2019 at 7:09 AM Arwin Tio <arwin....@hotmail.com> wrote: > > Hello, > > On Spark 2.4.4, I am using DataFrameReader#csv to read about 300000 files on > S3, and I've noticed that it takes about an hour for it to load the data on > the Driver. You can see the timestamp difference when the log from > InMemoryFileIndex occurs from 7:45 to 8:54: > > 19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4 > 19/09/06 07:44:42 INFO SparkContext: Submitted application: > LoglineParquetGenerator > ... > 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered > StateStoreCoordinator endpoint > 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under: [300K files...] > > > I believe that the issue comes from DataSource#checkAndGlobPathIfNecessary > [0], specifically from when it is calling FileSystem#exists. Unlike > bulkListLeafFiles, the existence check here happens in a single-threaded > flatMap, which is a blocking network call if your files are stored on S3. > > I believe that there is a fairly straightforward opportunity for improvement > here, which is to parallelize the existence check perhaps with a configurable > number of threads. If that seems reasonable, I would like to create a JIRA > ticket and submit a patch. Please let me know! > > Cheers, > > Arwin > > [0] > https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557 --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org