Steve Loughran created HADOOP-17943: ---------------------------------------
Summary: Add s3a tool to convert S3 server logs to avro/csv files Key: HADOOP-17943 URL: https://issues.apache.org/jira/browse/HADOOP-17943 Project: Hadoop Common Issue Type: Sub-task Components: fs/s3 Affects Versions: 3.3.2 Reporter: Steve Loughran Add s3a tool to convert S3 server logs to avro/csv files With S3A Auditing, we have code in hadoop-aws to parse s3 log entries, including splitting up the referrer into its fields. But we don't have an easy way of using it. I've done some early work in spark but as well as that code not working ([https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala]), it doesn't do the audit splitting. And, given that the S3 audit logs can be small on a lightly loaded store, not always justified. Proposed we add # utility parser class to take a row and split it into a record # which can be saved to avro through a schema we define # or exported to CSV with/without headers. (with: easy to understand, without: can cat files) # add a mapper so this can be used in MR jobs (could even make it committer test ..) # and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli {code:java} hadoop s3 parselogs -format avro -out s3a://dest/path -recursive s3a://stevel-london/logs/bucket1/* {code} would take all files under the path, load, parse and emit the output. design issues * would you combine all files, or emit a new .avro or .csv file for each one? * what's a good avro schema to cope with new context attributes * CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) writer ourselves. me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in the test suite. * would you want an initial filter during processing? especially for exit codes? me: no, though I could see the benefit for 503s. Best to let you load it into a notebook or spreadsheet and go from there. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org