@Daniel, there are at least 3 things that EMR can not solve, yet: - HA support - AWS provides auto scaling feature, but scale up/down EMR needs manual operations - security concerns in a public VPC
EMR is basically designed for short term running use cases with some pre-defined bootstrap actions and steps, so mainly for scheduled querying processes, not good as a permanent running cluster for adhoc queries and analytical works. Therefore in our organization (a e-commerce company in europe, most of you may never heard :p but we have more than 1000 techies and 10k employees now...), we made a solution for this: https://github.com/zalando/spark-appliance It enables HA with zookeeper, nodes are under a auto scaling group, and running in private subnets, provides REST api secured with oauth, and even integrated with jupyter notebook :) Am Samstag, 20. Februar 2016 schrieb Sabarish Sasidharan : > EMR does cost more than vanilla EC2. Using spark-ec2 can result in savings with large clusters, though that is not everybody's cup of tea. > > Regards > Sab > > On 19-Feb-2016 7:55 pm, "Daniel Siegmann" <daniel.siegm...@teamaol.com> wrote: >> >> With EMR supporting Spark, I don't see much reason to use the spark-ec2 script unless it is important for you to be able to launch clusters using the bleeding edge version of Spark. EMR does seem to do a pretty decent job of keeping up to date - the latest version (4.3.0) supports the latest Spark version (1.6.0). >> >> So I'd flip the question around and ask: is there any reason to continue using the spark-ec2 script rather than EMR? >> >> On Thu, Feb 18, 2016 at 11:39 AM, James Hammerton <ja...@gluru.co> wrote: >>> >>> I have now... So far I think the issues I've had are not related to this, but I wanted to be sure in case it should be something that needs to be patched. I've had some jobs run successfully but this warning appears in the logs. >>> Regards, >>> James >>> >>> On 18 February 2016 at 12:23, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>> Have you seen this ? >>>> HADOOP-10988 >>>> >>>> Cheers >>>> On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton <ja...@gluru.co> wrote: >>>>> >>>>> HI, >>>>> I am seeing warnings like this in the logs when I run Spark jobs: >>>>> >>>>> OpenJDK 64-Bit Server VM warning: You have loaded library /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. >>>>> It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. >>>>> >>>>> I used spark-ec2 to launch the cluster with the default AMI, Spark 1.5.2, hadoop major version 2.4. I altered the jdk to be openjdk 8 as I'd written some jobs in Java 8. The 6 workers nodes are m4.2xlarge and master is m4.large. >>>>> Could this contribute to any problems running the jobs? >>>>> Regards, >>>>> James >>> >> >