What is mrjob? ----------------- mrjob is a Python package that helps you write and run Hadoop Streaming jobs.
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster. Some important features: * Run jobs on EMR, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's $PYTHONPATH * Run make and other setup scripts * Set environment variables (e.g. $TZ) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by mrjob.conf config file * Automatically interpret error logs from EMR * SSH tunnel to hadoop job tracker on EMR * Minimal setup * To run on EMR, set $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY * To run on your Hadoop cluster, install simplejson and make sure $HADOOP_HOME is set. More info: * Install mrjob: python setup.py install * Documentation: http://packages.python.org/mrjob/ * PyPI: http://pypi.python.org/pypi/mrjob * Development is hosted at github: http://github.com/Yelp/mrjob What's new? ------------- v0.2.6, 2011-05-24 -- fix bootstrapping mrjob * Set Hadoop to run on EMR with --hadoop-version (Issue #71). * Default is still 0.18, but will change to 0.20 in mrjob v0.3.0. * New inline runner, for testing locally with a debugger * New --strict-protocols option, to catch unencodable data (Issue #76) * Added steps_python_bin option (for use with virtualenv) * mrjob no longer chokes when asked to run on an EMR job flow running Hadoop 0.20 (Issue #110) * mrjob no longer chokes on job flows with no LogUri (Issue #112) -- http://mail.python.org/mailman/listinfo/python-list