Hello,
I am trying to read Parquet files from HDFS and having problems. I use Avro for
schema. Here is a basic example:
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Job job = Job.getInstance();
HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new
AvroParquetInputFormat(), Void.class,
Customer.class, job);
FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new
org.apache.hadoop.fs.Path(
"/tmp/tpchinput/01/customer_parquet"));
Schema projection = Schema.createRecord(Customer.class.getSimpleName(),
null, null, false);
List<Schema.Field> fields = Arrays.asList(
new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null,
(Object) null)
);
projection.setFields(fields);
AvroParquetInputFormat.setRequestedProjection(job, projection);
DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif);
dataset.print();
}
If I submit this to the job manager I get the following stack trace:
java.lang.NoSuchMethodError:
org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
at misc.Misc.main(Misc.java:29)
The problem is that I use the parquet-avro dependency (which provides
AvroParquetInputFormat) in version 1.9.0 which relies on the avro dependency
1.8.0. The flink-core itself relies on the avro dependency in version 1.7.7.
Jfyi the dependency tree looks like this:
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ flink-experiments
---
[INFO] ...:1.0-SNAPSHOT
[INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile
[INFO] | +- org.apache.flink:flink-core:jar:1.2.0:compile
[INFO] | | \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for conflict
with 1.8.0)
[INFO] | \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile
[INFO] | \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for duplicate)
[INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile
[INFO] \- org.apache.avro:avro:jar:1.8.0:compile
Fixing the above NoSuchMethodError just leads to further problems. Downgrading
parquet-avro to an older version creates other conflicts as there is no version
that uses avro 1.7.7 like Flink does.
Is there a way around this or can you point me to another approach to read
Parquet data from HDFS? How do you normally go about this?
Thanks for your help,
Lukas