Re: pyspark sc.textFile uses only 4 out of 32 threads per node

Sebastián Ramírez Tue, 16 Dec 2014 14:26:21 -0800

Are you reading the file from your driver (main / master) program?

Is your file in a distributed system like HDFS? available to all your nodes?


It might be due to the laziness of transformations:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

"Transformations" are lazy, and aren't applied until they are needed by an
"action" (and, to me, it happend for readings too some time ago).
You can try calling a .first() in your RDD from once in a while to force it
to load the RDD to your cluster (but it might not be the cleanest way to do
it).


*Sebastián Ramírez*
Diseñador de Algoritmos

 <http://www.senseta.com>
________________
 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo <https://twitter.com/tiangolo>
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Tue, Dec 9, 2014 at 1:59 PM, Gautham <gautham.a...@gmail.com> wrote:
>
> I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
> sc.textFile to load data from a number of gz files, it does not progress as
> fast as expected. When I log-in to a child node and run top, I see only 4
> threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
> when processing the strings after loading, when all the cores are used to
> process the data.
>
> Please help me with this? What setting can be changed to get the CPU usage
> back up to full?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

Reply via email to