Hello,
I'm using Spark on YARN in cluster mode.
Is there a way to avoid copying the directory /etc/hadoop/conf to the
machine where I run spark-submit ?
Regards,
Yann.
Hello,
A Spark job on EMR reads parquet files located in an s3 bucket.
I use this option : spark.hadoop.fs.s3a.experimental.input.fadvise=random
When the ec2 instances and the bucket are in the same region, performance
are quite the same but when there are not, performance drops down (job
durati
Hello,
I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only *278* tasks are used
(I would have expected 5000). Does Spark group small files ? If yes, what
is the threshold for grouping ? Is it configurable ? Any l