How to avoid copying hadoop conf to submit on yarn

2019-01-29 Thread Yann Moisan
Hello, I'm using Spark on YARN in cluster mode. Is there a way to avoid copying the directory /etc/hadoop/conf to the machine where I run spark-submit ? Regards, Yann.

[Spark SQL] [Spark 2.4.0] Performance regression when reading parquet files from S3

2018-11-14 Thread Yann Moisan
Hello, A Spark job on EMR reads parquet files located in an s3 bucket. I use this option : spark.hadoop.fs.s3a.experimental.input.fadvise=random When the ec2 instances and the bucket are in the same region, performance are quite the same but when there are not, performance drops down (job durati

[Spark SQL] Does Spark group small files

2018-11-13 Thread Yann Moisan
Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only *278* tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any l