Gautham, How many number of gz files do you have? Maybe the reason is that gz file is compressed that can't be splitted for processing by Mapreduce. A single gz file can only be processed by a single Mapper so that the CPU treads can't be fully utilized.
-----Original Message----- From: Gautham [mailto:gautham.a...@gmail.com] Sent: Wednesday, December 10, 2014 3:00 AM To: u...@spark.incubator.apache.org Subject: pyspark sc.textFile uses only 4 out of 32 threads per node I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do sc.textFile to load data from a number of gz files, it does not progress as fast as expected. When I log-in to a child node and run top, I see only 4 threads at 100 cpu. All remaining 28 cores were idle. This is not an issue when processing the strings after loading, when all the cores are used to process the data. Please help me with this? What setting can be changed to get the CPU usage back up to full? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org