Re: 10hrs of Scheduler Delay

Darren Govoni Fri, 22 Jan 2016 13:55:00 -0800

    
Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.

Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------
From: Muthu Jayakumar <bablo...@gmail.com> 
Date: 01/22/2016  3:50 PM  (GMT-05:00) 
To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Does increasing the number of partition helps? You could try out something 3 
times what you currently have. Another trick i used was to partition the 
problem into multiple dataframes and run them sequentially and persistent the 
result and then run a union on the results. 
Hope this helps. 

On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:

Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.

Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu <yuzhih...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.

- Isaac

On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:

You may have seen the following on github page:

Latest commit 50fdf0e  on Feb 22, 2015

That was 11 months ago.

Can you search for similar algorithm which runs on Spark and is newer ?

If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.

Cheers

On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:

That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.

This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.

I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.

https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala

- Isaac

On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:

You may have noticed the following - did this indicate prolonged computation in 
your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)

On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:

Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:

The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.

Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks 

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:

Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

Reply via email to