for 1) yes, i believe change to hdfs will solve the problem, but i would
say changing storage is already a problem... :D

2) you use EMR right? so the spark is running with yarn i believe,
then spark.executor.cores is by default 1, if you want to change this, you
may need to change yarn settings as well

2016-02-16 23:10 GMT+01:00 Dukek, Dillon <dillon.du...@t-mobile.com>:

> Thanks for your response. I have a couple of questions if you don’t mind.
>
>
>
> 1)      Are you saying that if I was to bring all of the data into hdfs
> on the cluster that I could avoid this problem? My compressed parquet form
> of the data that is stored on s3 is only about 3 TB so that could be an
> option.
>
> 2)      This one is a little unrelated, but I haven’t been able to get
> executors to run with more than 1 core when I’m starting the thrift server
> or the spark sql shell using the –exector-cores flag or the
> spark.executor.cores parameter. Is that also a pre-existing issue that I’m
> not aware of? It’s not too much of a problem, however it makes it a little
> hard to utilize all of my available cpu when I want to leave enough memory
> on executors for any larger reduce tasks.
>
>
>
> Thanks,
>
>
>
> Dillon Dukek
>
> Software Engineer,
>
> Product Realization
>
> Data Products & Intelligence
>
> *•**T**•••Mobile•*
>
> Cell: 360-316-9309
>
> Email: dillon.du...@t-mobile.com
>
>
>
> *From:* Teng Qiu [mailto:teng...@gmail.com]
> *Sent:* Tuesday, February 16, 2016 12:11 PM
> *To:* Dukek, Dillon <dillon.du...@t-mobile.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark SQL step with many tasks takes a long time to begin
> processing
>
>
>
> i believe this is a known issue for using spark/hive with files on s3,
> this huge delay on driver side is caused by partition listing and split
> computation, and it is more like a issue by hive, since you are using
> thrift server, the sql queries are running in HiveContext.
>
>
>
> qubole made some optimizations for this:
> https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/
> , but it is not open sourced.
>
>
>
> we, at zalando, are using spark on aws as well, we made some changes to
> optimize insert performance for s3 files,  you can find the changes here:
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>
>
>
> hope this can give you/spark devs some ideas to fix this listing issue.
>
>
>
>
>
> 2016-02-16 19:59 GMT+01:00 Dukek, Dillon <dillon.du...@t-mobile.com>:
>
> Hello,
>
>
>
> I have been working on a project that allows a BI tool to query roughly 25
> TB of application event data from 2015 using the thrift server and Spark
> SQL. In general the jobs that are submitted have a step that submit many
> tasks in the order of hundreds of thousands and is equal to the number of
> files that need to be processed from s3. Once the step actually starts
> running tasks all of the nodes begin helping and executing tasks and
> finishes in a reasonable amount of time. However, after the logs say that
> the step was submitted with ~200,000 tasks there is a delay. I believe that
> the delay is caused by a shuffle step that happens after the step before
> that maps all of the files that we are going to process based on the date
> range specified in the query, but I’m not sure as I’m fairly new to Spark.
> I can see that that the driver node is processing during this time, but all
> of the other nodes are inactive. I was wondering if there is a way shorten
> the delay? I’m using Spark 1.6 on EMR  with yarn as the master, and ganglia
> to monitor the nodes. I’m giving 15GB to the driver and application master,
> and 3GB per executor with one core each.
>
>
>
>
>
> Thanks,
>
>
>
> Dillon Dukek
>
> Software Engineer,
>
> Product Realization
>
> Data Products & Intelligence
>
> *•**T**•••Mobile•*
>
> Email: dillon.du...@t-mobile.com
>
>
>
>
>

Reply via email to