> Why would you run the shuffle service on 10K nodes but Spark executors on just 100 nodes? wouldn't you also run that service just on the 100 nodes?
We have to start the service beforehand, out of band, and we don't know a priori where the Spark executors will land. Those 100 executors could land on any of the 10K nodes. > What does plumbing it through HDFS buy you in comparison? It drops the shuffle service requirement, which is HUGE. It means Spark can completely vacate the machine when it's not in use, which is crucial for a large, multi-tenant cluster. ShuffledRDDs can now read the map files from HDFS, rather than the ancestor executors, which means we can shut executors down immediately after the shuffle files are written. > There's some additional overhead and if anything you lose some control over locality, in a context where I presume HDFS itself is storing data on much more than the 100 Spark nodes. Write locality would be sacrificed, but the descendent executors were already doing a remote read (they have to read from multiple ancestor executors), so there's no additional cost in read locality. In fact, if we take advantage of HDFS's favored node feature, we could make it likely that all map files for a given partition land on the same node, so the descendent executor would never have to do a remote read! We'd effectively shift the remote IO from read side to write side, for theoretically no change in performance. In summary: Advantages: - No shuffle service dependency (increased utilization, decreased management cost) - Shut executors down immediately after shuffle files are written, rather than waiting for a timeout (increased utilization) - HDFS is HA, so shuffle files survive a node failure, which isn't true for the shuffle service (decreased latency during failures) - Potential ability to parallelize shuffle file reads if we write a new shuffle iterator (decreased latency) Disadvantages - Increased write latency (but potentially not if we implement it efficiently. See above). - Would need some sort of GC on HDFS shuffle files On Thu, Apr 28, 2016 at 1:36 AM, Sean Owen <so...@cloudera.com> wrote: > Why would you run the shuffle service on 10K nodes but Spark executors > on just 100 nodes? wouldn't you also run that service just on the 100 > nodes? > > What does plumbing it through HDFS buy you in comparison? There's some > additional overhead and if anything you lose some control over > locality, in a context where I presume HDFS itself is storing data on > much more than the 100 Spark nodes. > > On Thu, Apr 28, 2016 at 1:34 AM, Michael Gummelt <mgumm...@mesosphere.io> > wrote: > >> Are you suggesting to have shuffle service persist and fetch data with > >> hdfs, or skip shuffle service altogether and just write to hdfs? > > > > Skip shuffle service altogether. Write to HDFS. > > > > Mesos environments tend to be multi-tenant, and running the shuffle > service > > on all nodes could be extremely wasteful. If you're running a 10K node > > cluster, and you'd like to run a Spark job that consumes 100 nodes, you > > would have to run the shuffle service on all 10K nodes out of band of > Spark > > (e.g. marathon). I'd like a solution for dynamic allocation that doesn't > > require this overhead. > > > > I'll look at SPARK-1529. > > > > On Wed, Apr 27, 2016 at 10:24 AM, Steve Loughran <ste...@hortonworks.com > > > > wrote: > >> > >> > >> > On 27 Apr 2016, at 04:59, Takeshi Yamamuro <linguin....@gmail.com> > >> > wrote: > >> > > >> > Hi, all > >> > > >> > See SPARK-1529 for related discussion. > >> > > >> > // maropu > >> > >> > >> I'd not seen that discussion. > >> > >> I'm actually curious about why the 15% diff in performance between Java > >> NIO and Hadoop FS APIs, and, if it is the case (Hadoop still uses the > >> pre-NIO libraries, *has anyone thought of just fixing Hadoop Local FS > >> codepath?* > >> > >> It's not like anyone hasn't filed JIRAs on that ... it's just that > nothing > >> has ever got to a state where it was considered ready to adopt, where > >> "ready" means: passes all unit and load tests against Linux, Unix, > Windows > >> filesystems. There's been some attempts, but they never quite got much > >> engagement or support, especially as nio wasn't there properly until > Java 7, > >> —and Hadoop was stuck on java 6 support until 2015. That's no longer a > >> constraint: someone could do the work, using the existing JIRAs as > starting > >> points. > >> > >> > >> If someone did do this in RawLocalFS, it'd be nice if the patch also > >> allowed you to turn off CRC creation and checking. > >> > >> That's not only part of the overhead, it means that flush() doesn't, not > >> until you reach the end of a CRC32 block ... so breaking what few > durability > >> guarantees POSIX offers. > >> > >> > >> > > > > > > > > -- > > Michael Gummelt > > Software Engineer > > Mesosphere > -- Michael Gummelt Software Engineer Mesosphere