Hi, I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs, 122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of total data).
When I try to run TPC-H query 5, the query hangs for a long time mid-query. I've increased several timeouts to large values like 600seconds, in order to prevent block manager and connection ACK timeouts. I see that the CPU is being used even during the long pauses. (Not one core, but several cores), Query: select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer c join ( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from orders o join ( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from lineitem l join ( select n_name, s_suppkey, s_nationkey from supplier s join ( select n_name, n_nationkey from nation n join region r on n.n_regionkey = r.r_regionkey and r.r_name = 'ASIA' ) n1 on s.s_nationkey = n1.n_nationkey ) s1 on l.l_suppkey = s1.s_suppkey ) l1 on l1.l_orderkey = o.o_orderkey and o.o_orderdate >= '1994-01-01' and o.o_orderdate < '1995-01-01' ) o1 on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey group by n_name order by revenue desc; Below is the excerpt of the error on the worker node log after timeout. 14/09/23 14:21:25 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/09/23 14:21:25 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 5 non-empty blocks out of 320 blocks 14/09/23 14:21:25 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 5 remote fetches in 1 ms 14/09/23 14:32:12 WARN executor.Executor: Told to re-register on heartbeat 14/09/23 14:32:50 INFO storage.BlockManager: BlockManager re-registering with master 14/09/23 14:32:50 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/09/23 14:32:50 INFO storage.BlockManagerMaster: Registered BlockManager 14/09/23 14:32:50 WARN network.ConnectionManager: Could not find reference for received ack Message 338974 14/09/23 14:32:50 INFO storage.BlockManager: Reporting 507 blocks to the master. 14/09/23 14:32:50 ERROR storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(ip-10-45-47-24.ec2.internal,49905) java.io.IOException: sendMessageReliably failed because ack was not received within 600 sec at org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854) at org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852) at scala.Option.foreach(Option.scala:236) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) 14/09/23 14:33:06 ERROR storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(ip-10-239-184-234.ec2.internal,50538) java.io.IOException: sendMessageReliably failed because ack was not received within 600 sec at org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854) at org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852) at scala.Option.foreach(Option.scala:236) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) I have also attached a file listing the configuration parameters I am using. Anybody have any ideas why there is such a big pause? Also, is there any parameters I can tune to reduce this pause? I am seeing similar behaviour on several other queries where there are long pauses of 200-300s before the query starts making progress on the master. Some of the queries complete while the others do not. Any help would be appreciated. Regards, Samay spark-defaults.conf <http://apache-spark-user-list.1001560.n3.nabble.com/file/n14902/spark-defaults.conf> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Freezing-while-running-TPC-H-query-5-tp14902.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org