Thanks for your response. I have a couple of questions if you don’t mind.


1)      Are you saying that if I was to bring all of the data into hdfs on the 
cluster that I could avoid this problem? My compressed parquet form of the data 
that is stored on s3 is only about 3 TB so that could be an option.

2)      This one is a little unrelated, but I haven’t been able to get 
executors to run with more than 1 core when I’m starting the thrift server or 
the spark sql shell using the –exector-cores flag or the spark.executor.cores 
parameter. Is that also a pre-existing issue that I’m not aware of? It’s not 
too much of a problem, however it makes it a little hard to utilize all of my 
available cpu when I want to leave enough memory on executors for any larger 
reduce tasks.

Thanks,

Dillon Dukek
Software Engineer,
Product Realization
Data Products & Intelligence
•T•••Mobile•
Cell: 360-316-9309
Email: dillon.du...@t-mobile.com

From: Teng Qiu [mailto:teng...@gmail.com]
Sent: Tuesday, February 16, 2016 12:11 PM
To: Dukek, Dillon <dillon.du...@t-mobile.com>
Cc: user@spark.apache.org
Subject: Re: Spark SQL step with many tasks takes a long time to begin 
processing

i believe this is a known issue for using spark/hive with files on s3, this 
huge delay on driver side is caused by partition listing and split computation, 
and it is more like a issue by hive, since you are using thrift server, the sql 
queries are running in HiveContext.

qubole made some optimizations for this: 
https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/
 , but it is not open sourced.

we, at zalando, are using spark on aws as well, we made some changes to 
optimize insert performance for s3 files,  you can find the changes here: 
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando

hope this can give you/spark devs some ideas to fix this listing issue.


2016-02-16 19:59 GMT+01:00 Dukek, Dillon 
<dillon.du...@t-mobile.com<mailto:dillon.du...@t-mobile.com>>:
Hello,

I have been working on a project that allows a BI tool to query roughly 25 TB 
of application event data from 2015 using the thrift server and Spark SQL. In 
general the jobs that are submitted have a step that submit many tasks in the 
order of hundreds of thousands and is equal to the number of files that need to 
be processed from s3. Once the step actually starts running tasks all of the 
nodes begin helping and executing tasks and finishes in a reasonable amount of 
time. However, after the logs say that the step was submitted with ~200,000 
tasks there is a delay. I believe that the delay is caused by a shuffle step 
that happens after the step before that maps all of the files that we are going 
to process based on the date range specified in the query, but I’m not sure as 
I’m fairly new to Spark. I can see that that the driver node is processing 
during this time, but all of the other nodes are inactive. I was wondering if 
there is a way shorten the delay? I’m using Spark 1.6 on EMR  with yarn as the 
master, and ganglia to monitor the nodes. I’m giving 15GB to the driver and 
application master, and 3GB per executor with one core each.


Thanks,

Dillon Dukek
Software Engineer,
Product Realization
Data Products & Intelligence
•T•••Mobile•
Email: dillon.du...@t-mobile.com<mailto:dillon.du...@t-mobile.com>


Reply via email to