Are you reading the file from your driver (main / master) program? Is your file in a distributed system like HDFS? available to all your nodes?
It might be due to the laziness of transformations: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations "Transformations" are lazy, and aren't applied until they are needed by an "action" (and, to me, it happend for readings too some time ago). You can try calling a .first() in your RDD from once in a while to force it to load the RDD to your cluster (but it might not be the cleanest way to do it). *Sebastián Ramírez* Diseñador de Algoritmos <http://www.senseta.com> ________________ Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo <https://twitter.com/tiangolo> Email: sebastian.rami...@senseta.com www.senseta.com On Tue, Dec 9, 2014 at 1:59 PM, Gautham <gautham.a...@gmail.com> wrote: > > I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 > r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do > sc.textFile to load data from a number of gz files, it does not progress as > fast as expected. When I log-in to a child node and run top, I see only 4 > threads at 100 cpu. All remaining 28 cores were idle. This is not an issue > when processing the strings after loading, when all the cores are used to > process the data. > > Please help me with this? What setting can be changed to get the CPU usage > back up to full? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*