Hi! a few email exchanges on this ML coupled with John's remark that he'd be open for an architectural advice made realize that a discussion focused on a particular use case I'm trying to address might be much more fruitful than random questions here and there. It is a long email, but I hope it will be useful for the majority of folks subscribed to puppet-users@
This use case originates from the Apache Bigtop project. Bigtop is to Hadoop what Debian is to Linux -- we're a project aiming at building a 100% community driven BigData management distribution based on Apache Hadoop and its ecosystem projects. We are concerned with integration, packaging, deployment and system testing of the resulting distro and we also happen to be the basis for a few commercial distributions -- most notably Cloudera's CDH. Now, it must be mentioned that when I say 'a distribution' I really mean it. Here's the list of components that we have to manage (it is definitely not just Hadoop): https://issues.apache.org/jira/browse/BIGTOP-816?focusedCommentId=13560059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13560059 Our current Puppet code is a pretty old code base (originated in pre 2.X puppet) that currently serves as the main engine for us to dynamically deploy Bigtop clusters on EC2 for testing purposes. However, given that the Bigtop distro is the foundation for the commercial distros, we would like our Puppet code to be the go-to place for all the puppet-driven Hadoop deployment needs. Thus at the highest level, our Puppet code needs to be: #0 useful for as many versions of Puppet as possible. Obviously, we shouldn't obsess too much over something like Puppet 0.24, but we should keep the goal in mind #1 useful in a classical puppet-master driven setup where one has access to modules, hiera/extlookup, etc all nicely setup and maitained under /etc/puppet #2 useful in a masterless mode so that things like Puppet/Whirr integration can be utilized: https://issues.apache.org/jira/browse/WHIRR-385 This is the case where the Puppet classes are guaranteed to be delivered to each node out of band and --modulepath will be given to puppet apply. Everything else (hiera/extlookup files, etc) is likely to require additional out-of-band communications that we would like to minimize. #3 useful in orchestration scenarios (like Apache Ambari) although this could be viewed as a subset of the previous one. Now, given that a typical Hadoop cluster is a collection of nodes each of which is running a certain collection of services that belong to a particular subsystem (such as HDFS, YARN, HBase, etc.) My first instinct at modeling was to introduce a series of classes that would capture configuration of these subsystems. Plus, a top-level class that would correspond to settings common to the entire cluster. IOW, I would like to be able to express things like "in this cluster for every subsystem that cares about authentication the setting should be 'kerberos' and all of the jvm's should be given at minimum 1G of RAM, then I want node X to host HDFS's namenode, etc". All of this brings us to question #1 Q1: what would be the most natural way to instantiate such classes on every node that would satisfy #1-#3 styles of use? Initially, I was thinking of an ENC-style where a complete manifest of classes, their arguments and top-level parameters can be expected on/for every node. This style has a nice property of making classes completely independent of a particular way of instantiating them. IOW, I do not care whether a user of Bigtop's puppet code will explicitly put something like: class { "bigtop::hdfs::datanode": namenode_uri => "hdfs://nn.cluster.mycompany.com" .... } or whether somehow an ENC will generate: classes: bigtop::hdfs::datanode: namenode_uri: hdfs://nn.cluster.mycompany.com The classes do NOT care how they are being instantiated. Well, almost. They don't if I'm willing to make their use super-verbose essentially requiring that every single setting is given explicitly. IOW, even though something like namenode_uri will be exactly the same for all the services comprising HDFS subsystem, I will still require its explicit setting for every single class that gets instantiated (even on the same node). E.g.: class { "bigtop::hdfs::datanode": namenode_uri => "hdfs://nn.cluster.mycompany.com" } class { "bigtop::hdfs::secondarynamenode": namenode_uri => "hdfs://nn.cluster.mycompany.com" } Now, this brings us to the second question. Q2: In a situation like what would be an ideal way of making class instantiations less verbose? Also, as long as we are making them less verbose via some external means like hiera or extlookup, is there any point in keeping the instantiations to be along the lines of: class { "bigtop::hdfs::secondarynamenode": } instead of: include bigtop::hdfs::secondarynamenode ? After all, if we end up requiring *some* class parameters to be loaded from hiera/extlookup we might as well expect *all* of them to be loaded from there. This, by the way, will give us an extra benefit of being able to do things like: include bigtop::hdfs::datanode from multiple sites without fear of already declared resource. Finally, the question that has become really obvious to me after pondering this design is what ways of capturing a datum (e.g.: facter, top-scope variables, node-scope variables, class-scope variables, parent-class-scope variables) are the most appropriate for different types of information that we need to express about our clusters. Q3: IOW, what are the best practices for managing the following classes of class parameters (categorization stolen from Rich): 1) variables that are defaults that can be rationally set based on properties of the node itself, such as using os system for setting a package manager or package name to use, 2) variables that are set as part of group and the role they are playing, such as the set of common variables that all "slave" nodes in a cluster should have, 3) variables that are set as function of other components that relate or connect to them, such as a client needing the port and host address of a server, that is, variables that depend on toplogy, 4) variables that are set based on external context but can be categorized in a node by node basis, such as ntp server address based on location of data center, or which users should have logon access to which machines Thanks, Roman. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscr...@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.