Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks.
Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > Anyway it would be good if someone from the Mesos side investigates this and > proposes a solution. The 32 MB per task hack isn't completely foolproof > either (e.g. people might allocate all the RAM to their executor and thus > stop being able to launch tasks), so maybe we wait on a Mesos fix for this > one. > > Matei > > On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) > wrote: > > This is kind of weird then, seems perhaps unrelated to this issue (or at > least to the way I understood it). Is the problem maybe that Mesos saw 0 MB > being freed and didn't re-offer the machine *even though there was more than > 32 MB free overall*? > > Matei > > On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org) > wrote: > > I definitely saw a case where > > a. the only job running was a 256m shell > b. I started a 2g job > c. a little while later the same user as in a started another 256m shell > > My job immediately stopped making progress. Once user a killed his shells, > it started again. > > This is on nodes with ~15G of memory, on which we have successfully run 8G > jobs. > > > On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: >> >> BTW it seems to me that even without that patch, you should be getting >> tasks launched as long as you leave at least 32 MB of memory free on each >> machine (that is, the sum of the executor memory sizes is not exactly the >> same as the total size of the machine). Then Mesos will be able to re-offer >> that machine whenever CPUs free up. >> >> Matei >> >> On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) >> wrote: >> >> We have not tried the work-around because there are other bugs in there >> that affected our set-up, though it seems it would help. >> >> >> On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen <tnac...@gmail.com> wrote: >> >> > +1 to have the work around in. >> > >> > I'll be investigating from the Mesos side too. >> > >> > Tim >> > >> > On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia <matei.zaha...@gmail.com> >> > wrote: >> > > Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's >> > > too >> > bad that this happens in fine-grained mode -- would be really good to >> > fix. >> > I'll see if we can get the workaround in >> > https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally >> > have you tried that? >> > > >> > > Matei >> > > >> > > On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) >> > wrote: >> > > >> > > Hi Matei, >> > > >> > > We have an analytics team that uses the cluster on a daily basis. They >> > use two types of 'run modes': >> > > >> > > 1) For running actual queries, they set the spark.executor.memory to >> > something between 4 and 8GB of RAM/worker. >> > > >> > > 2) A shell that takes a minimal amount of memory on workers (128MB) >> > > for >> > prototyping out a larger query. This allows them to not take up RAM on >> > the >> > cluster when they do not really need it. >> > > >> > > We see the deadlocks when there are a few shells in either case. From >> > the usage patterns we have, coarse-grained mode would be a challenge as >> > we >> > have to constantly remind people to kill their shells as soon as their >> > queries finish. >> > > >> > > Am I correct in viewing Mesos in coarse-grained mode as being similar >> > > to >> > Spark Standalone's cpu allocation behavior? >> > > >> > > >> > > >> > > >> > > On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia >> > > <matei.zaha...@gmail.com> >> > wrote: >> > > Hey Gary, just as a workaround, note that you can use Mesos in >> > coarse-grained mode by setting spark.mesos.coarse=true. Then it will >> > hold >> > onto CPUs for the duration of the job. >> > > >> > > Matei >> > > >> > > On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) >> > wrote: >> > > >> > > I just wanted to bring up a significant Mesos/Spark issue that makes >> > > the >> > > combo difficult to use for teams larger than 4-5 people. It's covered >> > > in >> > > https://issues.apache.org/jira/browse/MESOS-1688. My understanding is >> > that >> > > Spark's use of executors in fine-grained mode is a very different >> > behavior >> > > than many of the other common frameworks for Mesos. >> > > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org