Problems reading Parquet input from HDFS

Lukas Kircher Mon, 24 Apr 2017 09:19:40 -0700

Hello,

I am trying to read Parquet files from HDFS and having problems. I use Avro for 
schema. Here is a basic example:


public static void main(String[] args) throws Exception {
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    Job job = Job.getInstance();
    HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new 
AvroParquetInputFormat(), Void.class,
        Customer.class, job);
    FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new 
org.apache.hadoop.fs.Path(
        "/tmp/tpchinput/01/customer_parquet"));
    Schema projection = Schema.createRecord(Customer.class.getSimpleName(), 
null, null, false);
    List<Schema.Field> fields = Arrays.asList(
        new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null, 
(Object) null)
    );
    projection.setFields(fields);
    AvroParquetInputFormat.setRequestedProjection(job, projection);

    DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif);
    dataset.print();
}
If I submit this to the job manager I get the following stack trace:

java.lang.NoSuchMethodError: 
org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
        at misc.Misc.main(Misc.java:29)

The problem is that I use the parquet-avro dependency (which provides 
AvroParquetInputFormat) in version 1.9.0 which relies on the avro dependency 
1.8.0. The flink-core itself relies on the avro dependency in version 1.7.7. 
Jfyi the dependency tree looks like this:

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ flink-experiments 
---
[INFO] ...:1.0-SNAPSHOT
[INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile
[INFO] |  +- org.apache.flink:flink-core:jar:1.2.0:compile
[INFO] |  |  \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for conflict 
with 1.8.0)
[INFO] |  \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile
[INFO] |     \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for duplicate)
[INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile
[INFO]    \- org.apache.avro:avro:jar:1.8.0:compile

Fixing the above NoSuchMethodError just leads to further problems. Downgrading 
parquet-avro to an older version creates other conflicts as there is no version 
that uses avro 1.7.7 like Flink does.

Is there a way around this or can you point me to another approach to read 
Parquet data from HDFS? How do you normally go about this?

Thanks for your help,
Lukas

Problems reading Parquet input from HDFS

Reply via email to