Re: Quick question regarding ParquetIO

Alexey Romanenko Thu, 07 Jan 2021 09:56:36 -0800

If you want to get just a PCollection<GenericRecord> as output then you would 
still need to set AvroCoder, but which schema to use in this case?


> On 6 Jan 2021, at 19:53, Tao Li <t...@zillow.com> wrote:
> 
> Hi Alexey,
>  
> Thank you so much for this info. I will definitely give it a try once 2.28 is 
> released.
>  
> Regarding this feature, it’s basically mimicking the feature from 
> AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html
>  
> <https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html>
>  
> I have one more quick question regarding the “reading records of an unknown 
> schema” scenario. In the sample code a PCollection<Foo> is being returned and 
> the parseGenericRecords requires a parsing logic. What if I just want to get 
> a PCollection<GenericRecord> instead of a specific class (e.g. Foo in the 
> example)? I guess I can just skip the ParquetIO.parseGenericRecords 
> transform? So do I still have to specify the dummy parsing logic like below? 
> Thanks!
>  
> p.apply(AvroIO.parseGenericRecords(new SerializableFunction<GenericRecord, 
> GenericRecord >() {
>        public Foo apply(GenericRecord record) {
>          return record;
>        }
>  
> From: Alexey Romanenko <aromanenko....@gmail.com>
> Reply-To: "user@beam.apache.org" <user@beam.apache.org>
> Date: Wednesday, January 6, 2021 at 10:13 AM
> To: "user@beam.apache.org" <user@beam.apache.org>
> Subject: Re: Quick question regarding ParquetIO
>  
> Hi Tao,
>  
> This jira [1] looks exactly what you are asking but it was merged recently 
> (thanks to Anant Damle for working on this!) and it should be available only 
> in Beam 2.28.0.
>  
> [1] https://issues.apache.org/jira/browse/BEAM-11460 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7Cc1a2c7a32ee64bdaf32b08d8b26ec466%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455536115879373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pLjqharsCRGvC7%2FJNPtOwMBAsXbNfujs%2BCnbbew0MLA%3D&reserved=0>
>  
> Regards,
> Alexey
> 
> 
>> On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com <mailto:t...@zillow.com>> 
>> wrote:
>>  
>> Hi beam community,
>>  
>> Quick question about ParquetIO 
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7Cc1a2c7a32ee64bdaf32b08d8b26ec466%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455536115889330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NvZGeUUZoMNBqRVBNNviMUq6uanJH4XNk05EEHTrngc%3D&reserved=0>.
>>  Is there a way to avoid specifying the avro schema when reading parquet 
>> files? The reason is that we may not know the parquet schema until we read 
>> the files. In comparison, spark parquet reader 
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7Cc1a2c7a32ee64bdaf32b08d8b26ec466%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455536115889330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xc4IanHypjltv8PeeDbt9eSQpgyFNUxE9nv1SgB2eTQ%3D&reserved=0>
>>  does not require such a schema specification.
>>  
>> Please advise. Thanks a lot!

Re: Quick question regarding ParquetIO

Reply via email to