Hi there! I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the spark-csv package. I want to load this data in parallel utilizing all 20 executors that I have, however by default only 3 executors are being used (which downloaded 5gb/5gb/4gb).
Here is my script (im using pyspark): lol_file = sqlCtx.load(source="com.databricks.spark.csv", header="false", path=lol_file_path) I have tried add option flags 1) minSplits=120, 2) minPartitions=120 but neither worked. I tried reading the source code but I'm noob at scala and could not figure out how the options are being used :( Thank you for reading and any help is much appreciated! Guang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sqlCtx-load-a-single-big-csv-file-from-s3-in-parallel-tp23163.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org