Sent: September 7, 2019 9:22 AM
To: Arwin Tio
Cc: Sean Owen ; dev@spark.apache.org
Subject: Re: DataFrameReader bottleneck in
DataSource#checkAndGlobPathIfNecessary when reading S3 files
On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio
mailto:arwin@hotmail.com>> wrote:
I think the prob
Hi Steve,
I filed a JIRA and opened a PR for this issue:
https://issues.apache.org/jira/browse/SPARK-29089
https://github.com/apache/spark/pull/25899
Please lmk what you think
Cheers,
Arwin
From: Steve Loughran
Sent: September 7, 2019 9:22 AM
To: Arwin Tio
mber 6, 2019 4:15 PM
To: Sean Owen
Cc: Arwin Tio ; dev@spark.apache.org
Subject: Re: DataFrameReader bottleneck in
DataSource#checkAndGlobPathIfNecessary when reading S3 files
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen
mailto:sro...@gmail.com>> wrote:
I think the problem is calling glo
Hello,
On Spark 2.4.4, I am using DataFrameReader#csv to read about 30 files on
S3, and I've noticed that it takes about an hour for it to load the data on the
Driver. You can see the timestamp difference when the log from
InMemoryFileIndex occurs from 7:45 to 8:54:
19/09/06 07:44:42 INFO S