Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-12-09 Thread Arwin Tio
Sent: September 7, 2019 9:22 AM To: Arwin Tio Cc: Sean Owen ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio mailto:arwin@hotmail.com>> wrote: I think the prob

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-23 Thread Arwin Tio
Hi Steve, I filed a JIRA and opened a PR for this issue: https://issues.apache.org/jira/browse/SPARK-29089 https://github.com/apache/spark/pull/25899 Please lmk what you think Cheers, Arwin From: Steve Loughran Sent: September 7, 2019 9:22 AM To: Arwin Tio

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Arwin Tio
mber 6, 2019 4:15 PM To: Sean Owen Cc: Arwin Tio ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 2:50 PM Sean Owen mailto:sro...@gmail.com>> wrote: I think the problem is calling glo

DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Arwin Tio
Hello, On Spark 2.4.4, I am using DataFrameReader#csv to read about 30 files on S3, and I've noticed that it takes about an hour for it to load the data on the Driver. You can see the timestamp difference when the log from InMemoryFileIndex occurs from 7:45 to 8:54: 19/09/06 07:44:42 INFO S