meng's proposal for GSOC2013

Project: http://git-wip-us.apache.org/repos/asf/cloudstack/repo
Commit: http://git-wip-us.apache.org/repos/asf/cloudstack/commit/6e245422
Tree: http://git-wip-us.apache.org/repos/asf/cloudstack/tree/6e245422
Diff: http://git-wip-us.apache.org/repos/asf/cloudstack/diff/6e245422

Branch: refs/heads/disk_io_throttling
Commit: 6e2454228313c3373bfa96ef623a4b6ceb88d0ee
Parents: c8d607e
Author: kyrameng <meng...@ufl.edu>
Authored: Sun Jun 9 19:09:20 2013 -0400
Committer: Sebastien Goasguen <run...@gmail.com>
Committed: Mon Jun 10 03:00:30 2013 -0400

----------------------------------------------------------------------
 docs/en-US/CloudStack_GSoC_Guide.xml |   2 +-
 docs/en-US/gsoc-meng.xml             | 235 ++++++++++++++++++++++++++++++
 2 files changed, 236 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/cloudstack/blob/6e245422/docs/en-US/CloudStack_GSoC_Guide.xml
----------------------------------------------------------------------
diff --git a/docs/en-US/CloudStack_GSoC_Guide.xml 
b/docs/en-US/CloudStack_GSoC_Guide.xml
index 243a0ca..1f43593 100644
--- a/docs/en-US/CloudStack_GSoC_Guide.xml
+++ b/docs/en-US/CloudStack_GSoC_Guide.xml
@@ -49,6 +49,6 @@
     <xi:include href="gsoc-tuna.xml" 
xmlns:xi="http://www.w3.org/2001/XInclude"; />
     <xi:include href="gsoc-imduffy15.xml" 
xmlns:xi="http://www.w3.org/2001/XInclude"; />
     <xi:include href="gsoc-dharmesh.xml" 
xmlns:xi="http://www.w3.org/2001/XInclude"; />
-
+    <xi:include href="gsoc-meng.xml" 
xmlns:xi="http://www.w3.org/2001/XInclude"; />
 </book>
 

http://git-wip-us.apache.org/repos/asf/cloudstack/blob/6e245422/docs/en-US/gsoc-meng.xml
----------------------------------------------------------------------
diff --git a/docs/en-US/gsoc-meng.xml b/docs/en-US/gsoc-meng.xml
new file mode 100644
index 0000000..1de259d
--- /dev/null
+++ b/docs/en-US/gsoc-meng.xml
@@ -0,0 +1,235 @@
+<?xml version='1.0' encoding='utf-8' ?>
+<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" 
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"; [
+<!ENTITY % BOOK_ENTITIES SYSTEM "CloudStack_GSoC_Guide.ent">
+%BOOK_ENTITIES;
+]>
+
+<!-- Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ 
+   http://www.apache.org/licenses/LICENSE-2.0
+ 
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+<chapter id="gsoc-meng">
+        <title>Meng's 2013 GSoC Proposal</title>
+        <para>This chapter describes Meng's 2013 Google Summer of Code project 
within the &PRODUCT; ASF project. It is a copy paste of the submitted 
proposal.</para>
+       <section id="Project-Description">
+               <title>Project Description</title>
+               <para>
+                       Getting a hadoop cluster going can be challenging and 
painful due to the tedious configuration phase and the diverse idiosyncrasies 
of each cloud provider. Apache Whirr<ulink url="http://whirr.apache.org/ 
"><citetitle>[1]</citetitle></ulink> and Provisionr is a set of libraries for 
running cloud services in an automatic or semi-automatic fashion. They take 
advantage of a cloud-neutral library called jclouds<ulink url=" 
http://www.jclouds.org/documentation/gettingstarted/what-is-jclouds/";><citetitle>[2]</citetitle></ulink>
 to create one-click, auto-configuring hadoop clusters on multiple clouds. 
Since jclouds supports CloudStack API, most of the services provided by Whirr 
and Provisionr should work out of the box on CloudStack. My first task is to 
test that assumption, make sure everything is well documented, and correct all 
issues with the latest version of CloudStack (4.0 and 4.1).
+               </para>
+               
+<para>
+The biggest challenge for hadoop provisioning is automatically configuring 
each instance at launch time based on what it is supposed to do, a process 
known as contextualization<ulink 
url="http://dl.acm.org/citation.cfm?id=1488934";><citetitle>[3]</citetitle></ulink><ulink
 url="http://www.nimbusproject.org/docs/current/clouds/clusters2.html 
"><citetitle>[4]</citetitle></ulink>. It causes last minute changes inside an 
instance to adapt to a cluster environment. Many automated cloud services are 
enabled by contextualization. For example in one-click hadoop clusters, 
contextualization basically amounts to generating and distributing ssh key 
pairs among instances, telling an instance where the master node is and what 
other slave nodes it should be aware of, etc. On EC2 contextualization is done 
via passing information through the EC2_USER_DATA entry<ulink 
url="http://aws.amazon.com/amazon-linux-ami/ 
"><citetitle>[5]</citetitle></ulink><ulink 
url="https://svn.apache.org/repos/asf/whirr/bra
 
nches/contrib-python/src/py/hadoop/cloud/data/hadoop-ec2-init-remote.sh"><citetitle>[6]</citetitle></ulink>.
 Whirr and Provisionr embrace this feature to provision hadoop instances on 
EC2. My second task is to test and extend Whirr and Provisionr’s one-click 
solution on EC2 to CloudStack and also improve CloudStack’s support for Whirr 
and Provisionr to enable hadoop provisioning on CloudStack based clouds.
+</para>
+<para>
+My third task is to add a Query API  that is compatible with Amazon Elastic 
MapReduce (EMR) to CloudStack. Through this API, all hadoop provisioning 
functionality will be exposed and users can reuse cloud clients that are 
written for EMR to create and manage hadoop clusters on CloudStack based clouds.
+</para>
+       </section>
+
+       <section id="Project-Details">
+               <title>Project Details</title>
+               <para>
+                       Whirr defines four roles for the hadoop provisioning 
service: Namenode, JobTracker, Datanode and TaskTraker. With the help of 
CloudInit<ulink url="https://help.ubuntu.com/community/CloudInit 
"><citetitle>[7]</citetitle></ulink> (a popular package for cloud instance 
initialization), each VM instance is configured based on its role and a 
compressed file that is passed in the EC2_USER_DATA entry. Since CloudStack 
also supports EC2_USER_DATA, I think the most feasible way to have hadoop 
provisioning on CloudStack is to extend Whirr’s solution on EC2 to CloudStack 
platform and to make necessary adjustment based on CloudStack’s
+               </para>
+               
+               <para>
+               Whirr and Provisionr deal with two critical issues in their 
role configuration scripts (configure-hadoop-role_list): SSH key authentication 
and hostname configuration.
+               </para>
+               <orderedlist>
+                       <listitem><para>
+                       SSH Key Authentication. The need for SSH Key based 
authentication is required so that the master node can login to slave nodes to 
start/stop hadoop daemons. Also each node needs to login to itself to start its 
own hadoop daemons. Traditionally this is done by generating a key pair on the 
master node and distributing the public key to all slave nodes. This can be 
only done with human intervention. Whirr works around this problem on EC2 by 
having a common key pair for all nodes in a hadoop cluster. Thus every node is 
able to login to one another. The key pair is provided by users and obtained by 
CloudInit inside an instance from metadata service. As far as I know, 
Cloudstack does not support user-provided ssh key authentication. Although 
CloudStack has the createSSHKeyPair API<ulink 
url="http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.0.2/html/Installation_Guide/using-sshkeys.html
 "><citetitle>[8]</citetitle></ulink> to generate SSH keys and users can create 
an instance
  template that supports SSH keys, there is no easy way to have a unified SSH 
key on all cluster instances. Besides Whirr prefers minimal image management, 
so having a customized template doesn’t seem quite fit here.
+                       </para></listitem>
+                       <listitem><para>
+                       Hostname configuration. The hostname of each instance 
has to be properly set and injected into the set of hadoop config files 
(core-site.xml, hdfs-site.xml, mapred-site.xml ). For an EC2 instance, its host 
name is converted from a combination of its public IP and an EC2-specific 
pre/suffix (e.g. an instance with IP 54.224.206.71 will have its hostname set 
to ec2-54-224-206-71.compute-1.amazonaws.com). This hostname amounts to the 
Fully Qualified Domain Name that uniquely identifies this node on the network.  
As for the case of CloudStack, if users do not specify a name the hostname that 
identifies a VM on a network will be a unique UUID generated by 
CloudStack<ulink 
url="https://cwiki.apache.org/CLOUDSTACK/allow-user-provided-hostname-internal-vm-name-on-hypervisor-instead-of-cloud-platform-auto-generated-name-for-guest-vms.html";><citetitle>[9]</citetitle></ulink>.
+
+
+
+                       </para></listitem>
+                       </orderedlist>
+                       <para>
+                       These two are the main issues that need support 
improvement on the CloudStack side. Other things like preparing disks, 
installing hadoop tarballs and starting hadoop daemons can be easily done as 
they are relatively role/instance-independent and static. Runurl can be used to 
simplify user-data scripts.
+
+
+
+                       </para>
+                       <para>
+                       After we achieve hadoop provisioning on CloudStack 
using Whirr we can go further to add a Query API to CloudStack to expose this 
functionality. I will write an API that is compatible with Amazon Elastic 
MapReduce Service (EMR)<ulink 
url="http://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html 
"><citetitle>[10]</citetitle></ulink> so that users can reuse clients that are 
written for EMR to submit jobs to existing hadoop clusters, poll job status, 
terminate a hadoop instance and do other things on CloudStack based clouds.  
There are eight actions<ulink 
url="http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Operations.html 
"><citetitle>[11]</citetitle></ulink> now supported in EMR API. I will try to 
implement as many as I can during the period of GSoC. The following statements 
give some examples of the API that I will write.
+                       </para>
+                       <programlisting><![CDATA[
+    https://elasticmapreduce.cloudstack.com?Action=RunJobFlow 
&Name=MyJobFlowName &Instances.MasterInstanceType=m1.small 
&Instances.SlaveInstanceType=m1.small &Instances.InstanceCount=4
+]]></programlisting>
+<para>
+This will launch a new hadoop cluster with four instances using specified 
instance types and add a job flow to it.
+</para>
+<programlisting><![CDATA[
+https://elasticmapreduce.cloudstack.com?Action=AddJobFlowSteps 
&JobFlowId=j-3UN6WX5RRO2AG &Steps.member.1.Name=MyStep2 
&Steps.member.1.HadoopJarStep.Jar=MyJar
+]]></programlisting>
+<para>
+This will add a step to the existing job flow with ID j-3UN6WX5RRO2AG. This 
step will run the specified jar file.
+</para>
+<programlisting><![CDATA[
+https://elasticmapreduce.cloudstack.com?Action=DescribeJobFlows 
&JobFlowIds.member.1=j-3UN6WX5RRO2AG
+]]></programlisting>
+<para>
+This will return the status of the given job flow.
+</para>
+       </section>
+
+       <section id="Roadmap">
+               <title>Roadmap</title>
+               
+               <para><emphasis role="bold">Jun. 17 ∼ Jun. 30</emphasis> 
</para>
+               <orderedlist>
+               <listitem><para>
+               Learn CloudStack and Apache Whirr/Provisionr APIs; Deploy a 
CloudStack cluster.
+               </para></listitem>
+               
+               <listitem><para>
+               Identify how EC2_USER_DATA is passed and executed on each 
CloudStack instance.
+               </para></listitem>
+               <listitem><para>
+               Figure out how the files passed in EC2_USER_DATA are acted upon 
by CloudInit.
+               </para></listitem>
+               <listitem><para>
+               Identify files in /etc/init/ that are used or modified by Whirr 
and Provisionr for hadoop related configuration.
+               </para></listitem>
+               <listitem><para>
+               Deploy a hadoop cluster on CloudStack via Whirr/Provisionr. 
This is to test what are missing in CloudStack or Whirr/Provisionr in terms of 
their support for each other.
+               </para></listitem>
+               </orderedlist>
+               <para><emphasis role="bold">Jul. 1∼ Aug. 1</emphasis> </para>
+               <orderedlist>
+               <listitem><para>
+               Write scripts to configure VM hostname on CloudStack with the 
help of CloudInit;
+               </para></listitem>
+               <listitem><para>
+               Write scripts to distribute SSH keys among CloudStack 
instances. Add the capability of using user-provided ssh key for authentication 
to CloudStack.
+               </para></listitem>
+               <listitem><para>
+               Take care of the other things left for hadoop provisioning, 
such as mounting disks, installing hadoop tarballs, etc.
+               </para></listitem>
+               <listitem><para>
+               Compose files that need to be passed in EC2_USER_DATA to each 
CloudStack instance . Test these files and write patches to make sure that 
Whirr/Provisionr can succefully deploy one-click hadoop clusters on CloudStack.
+               </para></listitem>
+               </orderedlist>
+               <para><emphasis role="bold">Aug. 3 ∼ Sep. 8</emphasis> </para>
+               <orderedlist>
+               <listitem><para>
+               Design and build an Elastic Mapreduce API for CloudStack that 
takes control of hadoop cluster creation and management.
+               </para></listitem>
+               <listitem><para>
+               Implement the eight actions defined in EMR API. This task might 
take a while.
+               </para></listitem>
+               
+               </orderedlist>
+               <para><emphasis role="bold">Sep. 10 ∼ Sep. 23</emphasis> 
</para>
+               <orderedlist>
+               <listitem><para>
+               
+    Code cleaning and documentation wrap up.
+
+               </para></listitem>
+               
+               </orderedlist>
+               
+               
+       </section>
+
+       <section id="Deliverables-meng">
+               <title>Deliverables</title>
+               <orderedlist>
+               <listitem><para>
+               
+ Whirr has limited support for CloudStack. Check what’s missing and make 
sure all steps are properly documented on the Whirr and CloudStack websites.
+               </para></listitem>
+               <listitem><para>
+               Contribute code to CloudStack and and send patches to 
Whirr/Provisionr if necessary to enable hadoop provisioning on CloudStack via 
Whirr/Provisionr.
+               </para></listitem>
+               <listitem><para>
+               Build an  EMR-compatible API for CloudStack.
+               </para></listitem>
+               </orderedlist>
+               </section>
+                       <section id="Nice-to-have">
+               <title>Nice to have</title>
+               <para>In addition to the required deliverables, it’s nice to 
have the following:</para>
+               <orderedlist>
+               <listitem><para>
+               
+ The capability to add and remove hadoop nodes dynamically to enable elastic 
hadoop clusters on CloudStack.
+
+               </para></listitem>
+               <listitem><para>
+               A review of the existing tools that offer one-click 
provisioning and make sure that they support CloudStack based clouds.
+               </para></listitem>
+               </orderedlist>
+       </section>
+
+                       <section id="References">
+               <title>References</title>
+               
+               <orderedlist>
+               <listitem><para>
+               
+ http://whirr.apache.org/
+               </para></listitem>
+               <listitem><para>
+               
http://www.jclouds.org/documentation/gettingstarted/what-is-jclouds/
+               </para></listitem>
+               <listitem><para>
+               Katarzyna Keahey, Tim Freeman, Contextualization: Providing 
One-Click Virtual Clusters
+               </para></listitem>
+               <listitem><para>
+               http://www.nimbusproject.org/docs/current/clouds/clusters2.html
+               </para></listitem>
+               <listitem><para>
+               http://aws.amazon.com/amazon-linux-ami/
+               </para></listitem>
+               <listitem><para>
+               
https://svn.apache.org/repos/asf/whirr/branches/contrib-python/src/py/hadoop/cloud/data/hadoop-ec2-init-remote.sh
+               </para></listitem>
+               <listitem><para>
+               https://help.ubuntu.com/community/CloudInit
+               </para></listitem>
+               <listitem><para>
+               
http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.0.2/html/Installation_Guide/using-sshkeys.html
+               </para></listitem>
+               <listitem><para>
+               
https://cwiki.apache.org/CLOUDSTACK/allow-user-provided-hostname-internal-vm-name-on-hypervisor-instead-of-cloud-platform-auto-generated-name-for-guest-vms.html
+               </para></listitem>
+               <listitem><para>
+http://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html
+               </para></listitem>
+               <listitem><para>
+               
http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Operations.html
+               </para></listitem>
+               <listitem><para>
+               http://buildacloud.org/blog/235-puppet-and-cloudstack.html
+               </para></listitem>
+               <listitem><para>
+http://chriskleban-internet.blogspot.com/2012/03/build-cloud-cloudstack-instance.html
+               </para></listitem>
+               <listitem><para>
+               http://gehrcke.de/2009/06/aws-about-api/
+               </para></listitem>
+               <listitem><para>
+               
Apache_CloudStack-4.0.0-incubating-API_Developers_Guide-en-US.pdf
+               </para></listitem>
+               
+               </orderedlist>
+       </section>
+       
+</chapter>

Reply via email to