Thanks for your response. I have a couple of questions if you don’t mind.
1) Are you saying that if I was to bring all of the data into hdfs on the cluster that I could avoid this problem? My compressed parquet form of the data that is stored on s3 is only about 3 TB so that could be an option. 2) This one is a little unrelated, but I haven’t been able to get executors to run with more than 1 core when I’m starting the thrift server or the spark sql shell using the –exector-cores flag or the spark.executor.cores parameter. Is that also a pre-existing issue that I’m not aware of? It’s not too much of a problem, however it makes it a little hard to utilize all of my available cpu when I want to leave enough memory on executors for any larger reduce tasks. Thanks, Dillon Dukek Software Engineer, Product Realization Data Products & Intelligence •T•••Mobile• Cell: 360-316-9309 Email: dillon.du...@t-mobile.com From: Teng Qiu [mailto:teng...@gmail.com] Sent: Tuesday, February 16, 2016 12:11 PM To: Dukek, Dillon <dillon.du...@t-mobile.com> Cc: user@spark.apache.org Subject: Re: Spark SQL step with many tasks takes a long time to begin processing i believe this is a known issue for using spark/hive with files on s3, this huge delay on driver side is caused by partition listing and split computation, and it is more like a issue by hive, since you are using thrift server, the sql queries are running in HiveContext. qubole made some optimizations for this: https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/ , but it is not open sourced. we, at zalando, are using spark on aws as well, we made some changes to optimize insert performance for s3 files, you can find the changes here: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando hope this can give you/spark devs some ideas to fix this listing issue. 2016-02-16 19:59 GMT+01:00 Dukek, Dillon <dillon.du...@t-mobile.com<mailto:dillon.du...@t-mobile.com>>: Hello, I have been working on a project that allows a BI tool to query roughly 25 TB of application event data from 2015 using the thrift server and Spark SQL. In general the jobs that are submitted have a step that submit many tasks in the order of hundreds of thousands and is equal to the number of files that need to be processed from s3. Once the step actually starts running tasks all of the nodes begin helping and executing tasks and finishes in a reasonable amount of time. However, after the logs say that the step was submitted with ~200,000 tasks there is a delay. I believe that the delay is caused by a shuffle step that happens after the step before that maps all of the files that we are going to process based on the date range specified in the query, but I’m not sure as I’m fairly new to Spark. I can see that that the driver node is processing during this time, but all of the other nodes are inactive. I was wondering if there is a way shorten the delay? I’m using Spark 1.6 on EMR with yarn as the master, and ganglia to monitor the nodes. I’m giving 15GB to the driver and application master, and 3GB per executor with one core each. Thanks, Dillon Dukek Software Engineer, Product Realization Data Products & Intelligence •T•••Mobile• Email: dillon.du...@t-mobile.com<mailto:dillon.du...@t-mobile.com>