for 1) yes, i believe change to hdfs will solve the problem, but i would say changing storage is already a problem... :D
2) you use EMR right? so the spark is running with yarn i believe, then spark.executor.cores is by default 1, if you want to change this, you may need to change yarn settings as well 2016-02-16 23:10 GMT+01:00 Dukek, Dillon <dillon.du...@t-mobile.com>: > Thanks for your response. I have a couple of questions if you don’t mind. > > > > 1) Are you saying that if I was to bring all of the data into hdfs > on the cluster that I could avoid this problem? My compressed parquet form > of the data that is stored on s3 is only about 3 TB so that could be an > option. > > 2) This one is a little unrelated, but I haven’t been able to get > executors to run with more than 1 core when I’m starting the thrift server > or the spark sql shell using the –exector-cores flag or the > spark.executor.cores parameter. Is that also a pre-existing issue that I’m > not aware of? It’s not too much of a problem, however it makes it a little > hard to utilize all of my available cpu when I want to leave enough memory > on executors for any larger reduce tasks. > > > > Thanks, > > > > Dillon Dukek > > Software Engineer, > > Product Realization > > Data Products & Intelligence > > *•**T**•••Mobile•* > > Cell: 360-316-9309 > > Email: dillon.du...@t-mobile.com > > > > *From:* Teng Qiu [mailto:teng...@gmail.com] > *Sent:* Tuesday, February 16, 2016 12:11 PM > *To:* Dukek, Dillon <dillon.du...@t-mobile.com> > *Cc:* user@spark.apache.org > *Subject:* Re: Spark SQL step with many tasks takes a long time to begin > processing > > > > i believe this is a known issue for using spark/hive with files on s3, > this huge delay on driver side is caused by partition listing and split > computation, and it is more like a issue by hive, since you are using > thrift server, the sql queries are running in HiveContext. > > > > qubole made some optimizations for this: > https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/ > , but it is not open sourced. > > > > we, at zalando, are using spark on aws as well, we made some changes to > optimize insert performance for s3 files, you can find the changes here: > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando > > > > hope this can give you/spark devs some ideas to fix this listing issue. > > > > > > 2016-02-16 19:59 GMT+01:00 Dukek, Dillon <dillon.du...@t-mobile.com>: > > Hello, > > > > I have been working on a project that allows a BI tool to query roughly 25 > TB of application event data from 2015 using the thrift server and Spark > SQL. In general the jobs that are submitted have a step that submit many > tasks in the order of hundreds of thousands and is equal to the number of > files that need to be processed from s3. Once the step actually starts > running tasks all of the nodes begin helping and executing tasks and > finishes in a reasonable amount of time. However, after the logs say that > the step was submitted with ~200,000 tasks there is a delay. I believe that > the delay is caused by a shuffle step that happens after the step before > that maps all of the files that we are going to process based on the date > range specified in the query, but I’m not sure as I’m fairly new to Spark. > I can see that that the driver node is processing during this time, but all > of the other nodes are inactive. I was wondering if there is a way shorten > the delay? I’m using Spark 1.6 on EMR with yarn as the master, and ganglia > to monitor the nodes. I’m giving 15GB to the driver and application master, > and 3GB per executor with one core each. > > > > > > Thanks, > > > > Dillon Dukek > > Software Engineer, > > Product Realization > > Data Products & Intelligence > > *•**T**•••Mobile•* > > Email: dillon.du...@t-mobile.com > > > > >