I am trying to figure out the best way to split output into different directories. My goal is to have a directory structure allowing me to add the content from each batch into the right bucket, like this:
... /content/200904/batch_20090429 /content/200904/batch_20090430 /content/200904/batch_20090501 /content/200904/batch_20090502 /content/200905/batch_20090430 /content/200905/batch_20090501 /content/200905/batch_20090502 ... I would then run my nightly jobs to build the index on /content/200904/* for the April index and /content/200905/* for the May index. I'm not sure whether I would be better off using MultipleOutputs or MultipleTextOutputFormat. I'm having trouble understanding how I set the output path for these two classes. It seems like MultipleTextOutputFormat is about partitioning data to different files within the same directory on the key, rather than into different directories as I need. Could I get the behavior I want by specifying date/batch as my filename, set output path to some temporary work directory, then move /work/* to /content? MultipleOutputs seems to be more about outputting all the data in different formats, but it's supposed to be simpler to use. Reading it, it seems to be better documented and the API makes more sense (choosing the output explicitly in the map or reduce, rather than hiding this decision in the output format), but I don't see any way to set a file name. If am using textoutputformat, I see no way to put these into different directories.