[Puppet Users] Request for an architectural advice for Hadoop ecosystem deployment

Roman Shaposhnik Mon, 18 Feb 2013 23:26:37 -0800

Hi!

a few email exchanges on this ML coupled with John's remark that he'd
be open for an architectural advice made realize that a discussion focused
on a particular use case I'm trying to address might be much more fruitful
than random questions here and there. It is a long email, but I hope it will
be useful for the majority of folks subscribed to puppet-users@


This use case originates from the Apache Bigtop project. Bigtop is to Hadoop
what Debian is to Linux -- we're a project aiming at building a 100% community
driven BigData management distribution based on Apache Hadoop and its
ecosystem projects. We are concerned with integration, packaging, deployment
and system testing of the resulting distro and we also happen to be the basis
for a few commercial distributions -- most notably Cloudera's CDH. Now,
it must be mentioned that when I say 'a distribution' I really mean it. Here's
the list of components that we have to manage (it is definitely not
just Hadoop):
   
https://issues.apache.org/jira/browse/BIGTOP-816?focusedCommentId=13560059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13560059

Our current Puppet code is a pretty old code base (originated in pre 2.X puppet)
that currently serves as the main engine for us to dynamically deploy Bigtop
clusters on EC2 for testing purposes. However, given that the Bigtop distro
is the foundation for the commercial distros, we would like our Puppet code
to be the go-to place for all the puppet-driven Hadoop deployment needs.

Thus at the highest level, our Puppet code needs to be:
   #0 useful for as many versions of Puppet as possible.
        Obviously, we shouldn't obsess too much over something
        like Puppet 0.24, but we should keep the goal in mind
   #1 useful in a classical puppet-master driven setup where
        one has access to modules, hiera/extlookup, etc all nicely
        setup and maitained under /etc/puppet
   #2 useful in a masterless mode so that things like Puppet/Whirr
        integration can be utilized:
https://issues.apache.org/jira/browse/WHIRR-385
        This is the case where the Puppet classes are guaranteed to be delivered
        to each node out of band and --modulepath will be given to puppet apply.
        Everything else (hiera/extlookup files, etc) is likely to
require additional
        out-of-band communications that we would like to minimize.
   #3 useful in orchestration scenarios (like Apache Ambari) although
        this could be viewed as a subset of the previous one.

Now, given that a typical Hadoop cluster is a collection of nodes each
of which is running a certain collection of services that belong to a particular
subsystem (such as HDFS, YARN, HBase, etc.) My first instinct at
modeling was to introduce a series of classes that would capture
configuration of these subsystems. Plus, a top-level class that would
correspond to settings common to the entire cluster. IOW, I would
like to be able to express things like "in this cluster for every subsystem
that cares about authentication the setting should be 'kerberos' and
all of the jvm's should be given at minimum 1G of RAM, then I want
node X to host HDFS's namenode, etc". All of this brings us to question #1

Q1: what would be the most natural way to instantiate such classes
on every node that would satisfy #1-#3 styles of use?

Initially, I was thinking of an ENC-style where a complete manifest
of classes, their arguments and top-level parameters can be
expected on/for every node. This style has a nice property of making
classes completely independent of a particular way of instantiating
them. IOW, I do not care whether a user of Bigtop's puppet code
will explicitly put something like:
     class { "bigtop::hdfs::datanode":
         namenode_uri => "hdfs://nn.cluster.mycompany.com"
         ....
     }
or whether somehow an ENC will generate:
    classes:
       bigtop::hdfs::datanode:
          namenode_uri: hdfs://nn.cluster.mycompany.com

The classes do NOT care how they are being instantiated.

Well, almost. They don't if I'm willing to make their use
super-verbose essentially requiring that every single
setting is given explicitly. IOW, even though something
like namenode_uri will be exactly the same for all the
services comprising HDFS subsystem, I will still require its
explicit setting for every single class that gets instantiated
(even on the same node). E.g.:
     class { "bigtop::hdfs::datanode":
         namenode_uri => "hdfs://nn.cluster.mycompany.com"
     }
     class { "bigtop::hdfs::secondarynamenode":
         namenode_uri => "hdfs://nn.cluster.mycompany.com"
     }

Now, this brings us to the second question.

Q2: In a situation like  what would be an ideal way of making
class instantiations less verbose? Also, as long as we are
making them less verbose via some external means like
hiera or extlookup, is there any point in keeping the
instantiations to be along the lines of:
   class { "bigtop::hdfs::secondarynamenode":
   }
instead of:
   include bigtop::hdfs::secondarynamenode
?

After all, if we end up requiring *some* class parameters to be
loaded from hiera/extlookup we might as well expect *all* of
them to be loaded from there. This, by the way, will give us an
extra benefit of being able to do things like:
     include bigtop::hdfs::datanode
from multiple sites without fear of already declared resource.

Finally, the question that has become really obvious to me
after pondering this design is what ways of capturing
a datum (e.g.: facter, top-scope variables, node-scope variables,
class-scope variables, parent-class-scope variables)
are the most appropriate for different types of information
that we need to express about our clusters.

Q3: IOW, what are the best practices for managing the following
classes of class parameters (categorization stolen from Rich):

1)  variables that are defaults that can be rationally set based
      on properties of the node itself, such as using os system for
      setting a  package manager or package name to use,
2)  variables that are set as part of group and the role they are
     playing, such as the set of common variables that all "slave"
     nodes in a cluster should have,
3)  variables that are set as function of other components that
     relate or connect to them, such as a client needing the port
     and  host address of a server, that is, variables that depend on
     toplogy,
4)  variables that are set based on external context but can be
     categorized in a node by node basis, such as ntp server
     address based on location of data center, or which users
     should have logon access to which machines

Thanks,
Roman.

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

[Puppet Users] Request for an architectural advice for Hadoop ecosystem deployment

Reply via email to