Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure:
job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []} # script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred.xxxx.class settings for non-streaming MapReduce) Problems caused by adding fields -------------------------------- The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add "savanna." configs ------------------------------------------------------------- I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of "oozie.xxx" settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow "savanna." settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside -------- Currently, all 'configs' are rendered in the generated oozie workflow. The "savanna." settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing "savanna." configs with config_hints for jobs, so users would potentially see "savanna.xxxx" settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different ------------------ We could also add a "'savanna-configs': {}" element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset --------------------------------- It is also the case that not every job type has every value type. Configs Params Args Hive Y Y N Pig Y Y Y MapReduce Y N N Java Y N Y So do we make that explicit in the docs and enforce it in the api with errors? Thoughts? I'm sure there are some :) Best, Trevor _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev