Marius Posta created AVRO-1976:
----------------------------------
Summary: Add Input/OutputFormat to read/write encoded objects
Key: AVRO-1976
URL: https://issues.apache.org/jira/browse/AVRO-1976
Project: Avro
Issue Type: Improvement
Components: java
Environment: hadoop
Reporter: Marius Posta
Priority: Minor
In certain cases, performance of some Avro map-reduce jobs can be considerably
improved by de-coupling Avro encoding from actual Avro container file IO.
In my case, a complex schema (100+ record fields) and large HDFS blocks
resulted in Spark jobs where a lot of workers were idling while a couple of
them were busy decoding their input splits.Furthermore, the objects then needed
to be re-encoded in order to be shuffled about, which crippled performance
further.
I propose the addition of an InputFormat which reads a container file input
split as key-value pairs in which the key is the file header and the value is
the decompressed file data block. Also, an OutputFormat which follows the same
logic for writing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)