Hello,

I have been working on a project that allows a BI tool to query roughly 25 TB 
of application event data from 2015 using the thrift server and Spark SQL. In 
general the jobs that are submitted have a step that submit many tasks in the 
order of hundreds of thousands and is equal to the number of files that need to 
be processed from s3. Once the step actually starts running tasks all of the 
nodes begin helping and executing tasks and finishes in a reasonable amount of 
time. However, after the logs say that the step was submitted with ~200,000 
tasks there is a delay. I believe that the delay is caused by a shuffle step 
that happens after the step before that maps all of the files that we are going 
to process based on the date range specified in the query, but I'm not sure as 
I'm fairly new to Spark. I can see that that the driver node is processing 
during this time, but all of the other nodes are inactive. I was wondering if 
there is a way shorten the delay? I'm using Spark 1.6 on EMR  with yarn as the 
master, and ganglia to monitor the nodes. I'm giving 15GB to the driver and 
application master, and 3GB per executor with one core each.


Thanks,

Dillon Dukek
Software Engineer,
Product Realization
Data Products & Intelligence
*T***Mobile*
Email: dillon.du...@t-mobile.com

Reply via email to