Thanks for your suggestions. @Flavio This is very similar to the code I use and yields basically the same problems. The examples are based on flink-1.0-SNAPSHOT and avro-1.7.6. which is more than three years old. Do you have a working setup with newer version of avro and flink?
@Jörn I tried to do that but I can't see how to get around the AvroParquetInputFormat (see below). I can pass a schema for projection as a string but then I get a NullPointerException as there is no ReadSupport class available in ParquetInputFormat. There is a constructor to instantiate ParquetInputFormat with a class that extends ReadSupport but I haven't found a suitable one to pass to the constructor. Do you know of a way around this? public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); Job job = Job.getInstance(); HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new ParquetInputFormat(), Void.class, Customer.class, job); FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new org.apache.hadoop.fs.Path( "/tmp/tpchinput/01/customer_parquet")); job.getConfiguration().set("parquet.avro.projection", "{\"type\":\"record\",\"name\":\"Customer\",\"fields\":[{\"name\":\"c_custkey\",\"type\":\"int\"}]}"); env.createInput(hif).print(); } I am pretty sure that I miss something very basic? Let me know if you need any additional information. Thanks ... > On 24 Apr 2017, at 20:51, Flavio Pompermaier <pomperma...@okkam.it> wrote: > > I started from this guide: > > https://github.com/FelixNeutatz/parquet-flinktacular > <https://github.com/FelixNeutatz/parquet-flinktacular> > > Best, > Flavio > > On 24 Apr 2017 6:36 pm, "Jörn Franke" <jornfra...@gmail.com > <mailto:jornfra...@gmail.com>> wrote: > Why not use a parquet only format? Not sure why you need an avtoparquetformat. > > On 24. Apr 2017, at 18:19, Lukas Kircher <lukas.kirc...@uni-konstanz.de > <mailto:lukas.kirc...@uni-konstanz.de>> wrote: > >> Hello, >> >> I am trying to read Parquet files from HDFS and having problems. I use Avro >> for schema. Here is a basic example: >> >> public static void main(String[] args) throws Exception { >> ExecutionEnvironment env = >> ExecutionEnvironment.getExecutionEnvironment(); >> >> Job job = Job.getInstance(); >> HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new >> AvroParquetInputFormat(), Void.class, >> Customer.class, job); >> FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new >> org.apache.hadoop.fs.Path( >> "/tmp/tpchinput/01/customer_parquet")); >> Schema projection = Schema.createRecord(Customer.class.getSimpleName(), >> null, null, false); >> List<Schema.Field> fields = Arrays.asList( >> new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null, >> (Object) null) >> ); >> projection.setFields(fields); >> AvroParquetInputFormat.setRequestedProjection(job, projection); >> >> DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif); >> dataset.print(); >> } >> If I submit this to the job manager I get the following stack trace: >> >> java.lang.NoSuchMethodError: >> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V >> at misc.Misc.main(Misc.java:29) >> >> The problem is that I use the parquet-avro dependency (which provides >> AvroParquetInputFormat) in version 1.9.0 which relies on the avro dependency >> 1.8.0. The flink-core itself relies on the avro dependency in version 1.7.7. >> Jfyi the dependency tree looks like this: >> >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ >> flink-experiments --- >> [INFO] ...:1.0-SNAPSHOT >> [INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile >> [INFO] | +- org.apache.flink:flink-core:jar:1.2.0:compile >> [INFO] | | \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for >> conflict with 1.8.0) >> [INFO] | \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile >> [INFO] | \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for >> duplicate) >> [INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile >> [INFO] \- org.apache.avro:avro:jar:1.8.0:compile >> >> Fixing the above NoSuchMethodError just leads to further problems. >> Downgrading parquet-avro to an older version creates other conflicts as >> there is no version that uses avro 1.7.7 like Flink does. >> >> Is there a way around this or can you point me to another approach to read >> Parquet data from HDFS? How do you normally go about this? >> >> Thanks for your help, >> Lukas >> >> >>