Gautham,

How many number of gz files do you have?  Maybe the reason is that gz file is 
compressed that can't be splitted for processing by Mapreduce. A  single gz  
file can only be processed by a single Mapper so that the CPU treads can't be 
fully utilized.

-----Original Message-----
From: Gautham [mailto:gautham.a...@gmail.com] 
Sent: Wednesday, December 10, 2014 3:00 AM
To: u...@spark.incubator.apache.org
Subject: pyspark sc.textFile uses only 4 out of 32 threads per node

I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 
r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do 
sc.textFile to load data from a number of gz files, it does not progress as 
fast as expected. When I log-in to a child node and run top, I see only 4 
threads at 100 cpu. All remaining 28 cores were idle. This is not an issue when 
processing the strings after loading, when all the cores are used to process 
the data.

Please help me with this? What setting can be changed to get the CPU usage back 
up to full?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to