Hi,
I created https://issues.apache.org/jira/browse/SPARK-21056 and proposed an
implementation here: https://github.com/apache/spark/pull/18269
I'll try to address cloud-fan's comment ASAP
Any input welcome.
Regards,
Bertrand
On Thu, Jun 15, 2017 at 1:27 AM, Mike Wheeler
wrote:
> I might hav
I might have a similar problem:
in the spark-shell:
val data = spark.read.parquet("...")
after hitting enter, it takes more than 30 seconds for the "read" to
complete and return the command line. I am running Spark 2.1.1. But I have
also tested it on 2.0.2 and encountered the same issue.
thanks,
Hi Bertrand,
I encourage you to create a ticket for this and submit a PR if you have time.
Please add me as a listener, and I'll try to contribute/review.
Michael
> On Jun 6, 2017, at 5:18 AM, Bertrand Bossy
> wrote:
>
> Hi,
>
> since moving to spark 2.1 from 2.0, we experience a performanc
Hi,
since moving to spark 2.1 from 2.0, we experience a performance regression
when reading a large, partitioned parquet dataset:
We observe many (hundreds) very short jobs executing before the job that
reads the data is starting. I looked into this issue and pinned it down to
PartitioningAwareFi