Hi, I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs, 122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of total data).
When I try to run TPC-H query 3 i.e. select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; I get the following output on the master node aftera very long pause:- 14/09/22 16:55:57 INFO scheduler.TaskSetManager: Finished task 197.0 in stage 17.0 (TID 23821) in 346 ms on ip-10-45-25-51.ec2.internal (239/320) 14/09/22 16:55:57 INFO scheduler.TaskSetManager: Finished task 235.0 in stage 17.0 (TID 23859) in 343 ms on ip-10-45-25-51.ec2.internal (240/320) 14/09/22 16:59:28 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-10-35-182-185.ec2.internal,35198) 14/09/22 16:59:28 INFO network.ConnectionManager: Key not valid ? sun.nio.KEY 14/09/22 16:59:28 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-10-35-182-185.ec2.internal,35198) 14/09/22 16:59:28 INFO network.ConnectionManager: key already cancelled ? sun.nio.KEY java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) Then the executors start getting removed. Any ideas as to why this might be occurring? Any help will be appreciated. *Notes that might be helpful:* I noticed that there is always a very long pause(250-300seconds) after 240 reduce tasks are executed. Also, sometimes I get the error after 245 or 250 reduce tasks but the pause is always after 240 reduce tasks. I could not see any relevant information in the worker node logs. These were the last lines. INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19 remote fetches in 4 ms INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19 remote fetches in 4 ms INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19 remote fetches in 5 ms INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19 remote fetches in 5 ms Relevant configuration information: I am using cached compressed tables i.e. I have set spark.sql.inMemoryColumnarStorage.compressed=true and then I run the cache table command for all the tables. The other configuration parameters I have set are as follows:- spark.executor.memory 117760m spark.executor.extraLibraryPath /root/ephemeral-hdfs/lib/native/ spark.executor.extraClassPath /root/ephemeral-hdfs/conf spark.worker.timeout 600 spark.serializer org.apache.spark.serializer.KryoSerializer spark.storage.memoryFraction 0.6 spark.storage.blockManagerSlaveTimeoutMs 100000 spark.shuffle.memoryFraction 0.3 spark.shuffle.consolidateFiles true spark.shuffle.file.buffer.kb 512 spark.akka.timeout 600 spark.akka.framesize 512 spark.akka.threads 8 spark.core.connection.ack.wait.timeout 600 spark.spark.sql.shuffle.partitions 320 Regards, Samay -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Key-not-valid-while-running-TPC-H-tp14823.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org