sqlCtx.load a single big csv file from s3 in parallel

gy8 Thu, 04 Jun 2015 15:56:28 -0700

Hi there!

I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the
spark-csv package. I want to load this data in parallel utilizing all 20
executors that I have, however by default only 3 executors are being used
(which downloaded 5gb/5gb/4gb).


Here is my script (im using pyspark):

lol_file = sqlCtx.load(source="com.databricks.spark.csv",
                              header="false",
                              path=lol_file_path)

I have tried add option flags 1) minSplits=120, 2) minPartitions=120 but
neither worked. I tried reading the source code but I'm noob at scala and
could not figure out how the options are being used :(

Thank you for reading and any help is much appreciated!

Guang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sqlCtx-load-a-single-big-csv-file-from-s3-in-parallel-tp23163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

sqlCtx.load a single big csv file from s3 in parallel

Reply via email to