Hi,

I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge
master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs,
122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of
total data). 

When I try to run TPC-H query 5, the query hangs for a long time mid-query.
I've increased several timeouts to large values like 600seconds, in order to
prevent block manager and connection ACK timeouts. I see that the CPU is
being used even during the long pauses. (Not one core, but several cores),

Query:
select
n_name, sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer c join
( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from
orders o join
( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from
lineitem l join
( select n_name, s_suppkey, s_nationkey from supplier s join
( select n_name, n_nationkey
from nation n join region r
on n.n_regionkey = r.r_regionkey and r.r_name = 'ASIA'
) n1 on s.s_nationkey = n1.n_nationkey
) s1 on l.l_suppkey = s1.s_suppkey
) l1 on l1.l_orderkey = o.o_orderkey and o.o_orderdate >= '1994-01-01'
and o.o_orderdate < '1995-01-01'
) o1
on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey
group by n_name
order by revenue desc;

Below is the excerpt of the error on the worker node log after timeout.

14/09/23 14:21:25 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
50331648, targetRequestSize: 10066329
14/09/23 14:21:25 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 5 non-empty
blocks out of 320 blocks
14/09/23 14:21:25 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 5 remote
fetches in 1 ms 
14/09/23 14:32:12 WARN executor.Executor: Told to re-register on heartbeat
14/09/23 14:32:50 INFO storage.BlockManager: BlockManager re-registering
with master
14/09/23 14:32:50 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/09/23 14:32:50 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/23 14:32:50 WARN network.ConnectionManager: Could not find reference
for received ack Message 338974
14/09/23 14:32:50 INFO storage.BlockManager: Reporting 507 blocks to the
master. 
14/09/23 14:32:50 ERROR
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId(ip-10-45-47-24.ec2.internal,49905)
java.io.IOException: sendMessageReliably failed because ack was not received
within 600 sec 
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
    at scala.Option.foreach(Option.scala:236)
    at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)
14/09/23 14:33:06 ERROR
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId(ip-10-239-184-234.ec2.internal,50538)
java.io.IOException: sendMessageReliably failed because ack was not received
within 600 sec 
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
    at scala.Option.foreach(Option.scala:236)
    at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)

I have also attached a file listing the configuration parameters I am using.

Anybody have any ideas why there is such a big pause? Also, is there any
parameters I can tune to reduce this pause?

I am seeing similar behaviour on several other queries where there are long
pauses of 200-300s before the query starts making progress on the master.
Some of the queries complete while the others do not. Any help would be
appreciated.

Regards,
Samay

spark-defaults.conf
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n14902/spark-defaults.conf>
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Freezing-while-running-TPC-H-query-5-tp14902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to