Alexey,

Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to 
specify a schema when reading parquet files to get a 
PCollection<GenericRecord>. Is my understanding correct? Am I missing anything 
here?

Thanks!

From: Alexey Romanenko <aromanenko....@gmail.com>
Reply-To: "user@beam.apache.org" <user@beam.apache.org>
Date: Thursday, January 7, 2021 at 9:56 AM
To: "user@beam.apache.org" <user@beam.apache.org>
Subject: Re: Quick question regarding ParquetIO

If you want to get just a PCollection<GenericRecord> as output then you would 
still need to set AvroCoder, but which schema to use in this case?


On 6 Jan 2021, at 19:53, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Hi Alexey,

Thank you so much for this info. I will definitely give it a try once 2.28 is 
released.

Regarding this feature, it’s basically mimicking the feature from 
AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.26.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DFj0d81U%2F0rjQ3loRTzMLBOZFdJ9rEPK2PERsu7KgAo%3D&reserved=0>

I have one more quick question regarding the “reading records of an unknown 
schema” scenario. In the sample code a PCollection<Foo> is being returned and 
the parseGenericRecords requires a parsing logic. What if I just want to get a 
PCollection<GenericRecord> instead of a specific class (e.g. Foo in the 
example)? I guess I can just skip the ParquetIO.parseGenericRecords transform? 
So do I still have to specify the dummy parsing logic like below? Thanks!

p.apply(AvroIO.parseGenericRecords(new SerializableFunction<GenericRecord, 
GenericRecord >() {
       public Foo apply(GenericRecord record) {
         return record;
       }

From: Alexey Romanenko 
<aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Date: Wednesday, January 6, 2021 at 10:13 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YpH3Rtz%2FcnE9LwfLzNyPOalaW8OUSL5sxffolKiOv%2Bk%3D&reserved=0>

Regards,
Alexey



On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cr5MTRb4cZCLof85nfPUxtMKGRQvhJ4zLPEJa7STEjM%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WJWWqx%2B4OLzHeypOs1Dyvlio9fg%2BXGGk1OgocJu3m8g%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!


Reply via email to