Hi, I am using Hadoop 0.20.1 and I have a problem which is similar to the one below:
Setup ----- I have a web server log that looks like: > serial number, domain, datetime, httpstatus The web server outputs to a single log file for 1,000 domains. I would like an output in the following format: > previous serial number, domain1, previous datetime, previous httpstatus, next serial number, current datetime, current httpstatus For example, an input of: > 1728 ... > 1729, hadoop.apache.org, 2009/10/18 08:23:15, 200 > 1730, ... > ... > 1735, ... > 1736, hadoop.apache.org, 2009/10/18 08:23:19, 404 > 1737, ... > ... > 1741, ... > 1742, hadoop.apache.org, 2009/10/18 08:23:24, 500 > 1743, ... > ... > 1750, ... > 1751, hadoop.apache.org, 2009/10/18 08:23:30, 200 > 1752 ... would output: > 1729, hadoop.apache.org, 2009/10/18 08:23:15, 200, 1736, 2009/10/18 08:23:19, 404 > 1736, hadoop.apache.org, 2009/10/18 08:23:19, 404, 1742, 2009/10/18 08:23:24, 500 > 1742, hadoop.apache.org, 2009/10/18 08:23:24, 500, 1751, 2009/10/18 08:23:30, 200 My thoughts ----------- I have separated the problem into 2 jobs: 1) Do a Secondary Sort on the log file to output a file sorted primarily by <domain> followed by <serial number> 2a) Implement a custom TwoLineRecordReader<LongWritable, Text> that takes in the previous output as the input. The custom RecordReader: 2a) i) During initialize(InputSplit, TaskAttemptContext), reads the first line. 2a) ii) During nextKeyValue(), reads the second line output and sets <value> to firstLine + "|" + secondLine. Consequently, sets firstLine to secondLine. 2b) The mapper and reducer generates the output. I have been successful at job 1. Problems -------- It seems as though the job is not using TwoLineRecordReader, even though I've specified it through a custom InputFormat. Instead, it outputs the same input file when I do a println on <value> in Mapper. > TwoLineInputFormat.addInputPath(job, new Path("output/sorted/part-r-00000")); > TextOutputFormat.setOutputPath(job, new Path("output/transitions")); Call to action -------------- 1) Perhaps I'm not thinking of the problem the right way. Would you suggest another way to solve it? 2) Am I implementing the custom RecordReader in the right way? Thank you! Regards, Shahfik Amasha Undergraduate School of Information Systems Singapore Management University