Hi there!
I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the
spark-csv package. I want to load this data in parallel utilizing all 20
executors that I have, however by default only 3 executors are being used
(which downloaded 5gb/5gb/4gb).
Here is my script (im using pyspark):
lol_file = sqlCtx.load(source="com.databricks.spark.csv",
header="false",
path=lol_file_path)
I have tried add option flags 1) minSplits=120, 2) minPartitions=120 but
neither worked. I tried reading the source code but I'm noob at scala and
could not figure out how the options are being used :(
Thank you for reading and any help is much appreciated!
Guang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sqlCtx-load-a-single-big-csv-file-from-s3-in-parallel-tp23163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]