It's great that spark scheduler does optimized DAG processing and only does lazy eval when some action is performed or shuffle dependency is encountered. Sometime it goes further after shuffle dep before executing anything. e.g. if there are map steps after shuffle then it doesn't stop at shuffle to execute anything but goes to that next map steps until it finds a reason(spark action) to execute. As a result stage that spark is running can be internally series of (map -> shuffle -> map -> map -> collect) and spark UI just shows its currently running 'collect' stage. SO if job fails at that point spark UI just says Collect failed but in fact it could be any stage in that lazy chain of evaluation. Looking at executor logs gives some insights but that's not always straightforward. Correct me if I am wrong here but I think we need more visibility into what's happening underneath so we can easily troubleshoot as well as optimize our DAG.
THanks -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>