Re: Filtering lines in parquet

2021-03-12 Thread Avi Levi
Cool, thanks! On Fri, Mar 12, 2021, 13:15 Arvid Heise wrote: > Hi Avi, > > thanks for clarifying. > > It seems like it's not possible to parse Parquet in Flink without knowing > the schema. What i'd do is to parse the metadata while setting up the job > and then pass it to the input format: > >

Re: Filtering lines in parquet

2021-03-12 Thread Arvid Heise
Hi Avi, thanks for clarifying. It seems like it's not possible to parse Parquet in Flink without knowing the schema. What i'd do is to parse the metadata while setting up the job and then pass it to the input format: ParquetMetadata parquetMetadata = MetadataReader.readFooter(inputStream, path,

Re: Filtering lines in parquet

2021-03-11 Thread Avi Levi
Hi Arvid, assuming that I have A0,B0,C0 parquet files with different schema and a common field *ID*, I want to write them to A1,B2,C3 files respectively. My problem is that in my code I do not want to know the full schema just by filtering using the ID field and writing the unfiltered lines to the

Re: Filtering lines in parquet

2021-03-11 Thread Arvid Heise
Hi Avi, I'm not entirely sure I understand the question. Let's say you have source A, B, C all with different schema but all have an id. You could use the ParquetMapInputFormat that provides a map of the records and just use a map-lookup. However, I'm not sure how you want to write these records

Filtering lines in parquet

2021-03-10 Thread Avi Levi
Hi all, I am trying to filter lines from parquet files, the problem is that they have different schemas, however the field that I am using to filter exists in all schemas. in spark this is quite straight forward : *val filtered = rawsDF.filter(col("id") != "123")* I tried to do it in flink by ext