Hello, I have been working on a project that allows a BI tool to query roughly 25 TB of application event data from 2015 using the thrift server and Spark SQL. In general the jobs that are submitted have a step that submit many tasks in the order of hundreds of thousands and is equal to the number of files that need to be processed from s3. Once the step actually starts running tasks all of the nodes begin helping and executing tasks and finishes in a reasonable amount of time. However, after the logs say that the step was submitted with ~200,000 tasks there is a delay. I believe that the delay is caused by a shuffle step that happens after the step before that maps all of the files that we are going to process based on the date range specified in the query, but I'm not sure as I'm fairly new to Spark. I can see that that the driver node is processing during this time, but all of the other nodes are inactive. I was wondering if there is a way shorten the delay? I'm using Spark 1.6 on EMR with yarn as the master, and ganglia to monitor the nodes. I'm giving 15GB to the driver and application master, and 3GB per executor with one core each.
Thanks, Dillon Dukek Software Engineer, Product Realization Data Products & Intelligence *T***Mobile* Email: dillon.du...@t-mobile.com