I am involved in some Hadoop deployments and there is a very interesting possiblity for Pharo in that ecosystem.
Namely, there is a YARN thing in there which is a scheduler for distributing computing on a cluster of nodes. It is possible to deploy all kinds of technologies on the nodes (e.g. Python, R, Java) and Pharo images and VM (in headless mode) could be deployed as well. The deployed node can communicate back to what is called an AppllicationManager via REST callbacks (easy game in Pharo). There is also a C API (now, this is FFI or a plugin - http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html) There is also an Hadoop component named ZooKeeper that focuses on acting as a distributed configuration repository. One can talk to it with REST too ( https://github.com/apache/zookeeper/tree/trunk/src/contrib/rest) Given the fact that we also can use some Java calls (using the JNI module with 32-bits Java), we can integrate well enough on YARN I'd say. There is also another project which is very nice and this is SLIDER (on YARN). This is about deploying stuff in an elastic way, (see http://slider.incubator.apache.org/) The next logical thing is to have docker containers (containing a pharo stack) deployed dynamically on the cluster using Slider (like this: http://www.slideshare.net/hortonworks/docker-on-slider-45493303) First step here would be to have a basic YARN-Pharo application and a PoC for talking to ZooKeeper. This would open interesting gates for Pharo given its strengths. Even more when we'll get a 64-bit VM. What is cool with Pharo is that an image can be very small and self containing vs Java application (which have tons of Jar files attached). Access to the data on the HDFS thing can happen through NFSv3 so, we can go that route. There is also a REST API to it ( https://hadoop.apache.org/docs/r1.0.4/webhdfs.html) Tell me what you think! Phil