[ANN] Pallet-Hadoop (Hadoop clusters as data structures)

Sam Ritchie Wed, 01 Jun 2011 19:27:12 -0700

Hey all,

I'd like to announce
Pallet-Hadoop<https://github.com/pallet/pallet-hadoop/tree/master>,
a layer built on top of Pallet <https://github.com/pallet/pallet> that
allows users to describe a Hadoop cluster configuration as a nested clojure
map. Here's a cluster with one master node and two slave nodes with some
custom properties, all 64 bit machines with at least 4 gigs of RAM, running
Ubuntu 10.10:


(def example-cluster
  (cluster-spec :private
                {:jobtracker (node-group [:jobtracker :namenode])
                 :slaves     (slave-group 2)}
                :base-machine-spec {:os-family :ubuntu
                                    :os-version-matches "10.10"
                                    :os-64-bit true
                                    :min-ram (* 4 1024)}
                :base-props {:hdfs-site {:dfs.data.dir "/mnt/dfs/data"
                                         :dfs.name.dir "/mnt/dfs/name"}
                             :mapred-site {:mapred.task.timeout 300000
                                           :mapred.reduce.tasks 3}}))

Thanks to Pallet's flexibility and use of
jclouds<https://github.com/jclouds/jclouds>,
the cluster description can be written without reference to any specific
cloud provider, and can be used to boot machines on any of the major cloud
providers <https://github.com/jclouds/jclouds#readme> (or on local virtual
machines!) with a simple change of credentials.

This example project <https://github.com/pallet/pallet-hadoop-example> contains
everything you need to get started; it walks through all steps necessary to
boot a cluster and run the canonical word count example on Amazon's EC2
platform. The project wiki
<https://github.com/pallet/pallet-hadoop/wiki> contains
a lot more detail on the design and flexibility of the data structures
involved.

Future plans include intelligent default settings that adjust based on the
specs of the cluster, and the ability to run
Cascalog<https://github.com/nathanmarz/cascalog> queries
on these distributed clusters from Cake and Leiningen.

I'd love to hear what you all think about this! Huge thanks to Hugo
Duncan<http://hugoduncan.org/>for getting this started, and to Toni
Batchelli <http://tbatchelli.org/> for his excellent work on this project
and its foundation, Pallet's new Hadoop
crate<https://github.com/pallet/pallet-apache-crates>
.

Cheers,
Sam

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

[ANN] Pallet-Hadoop (Hadoop clusters as data structures)

Reply via email to