Hi Brian, Please see my answers inline.
From: Brian Hulette <bhule...@google.com> Reply-To: "user@beam.apache.org" <user@beam.apache.org> Date: Wednesday, January 6, 2021 at 10:43 AM To: user <user@beam.apache.org> Subject: Re: Quick question regarding ParquetIO Hey Tao, It does look like BEAM-11460 could work for you. Note that relies on a dynamic object which won't work with schema-aware transforms and SqlTransform. It's likely this isn't a problem for you, I just wanted to point it out. [tao] I just need a PCollection<GenericRecord> from IO. Then I can apply below code to enable the schemas transforms (I have verified this code works). setSchema( AvroUtils.toBeamSchema(schema), new TypeDescriptor[GenericRecord]() {}, AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)), AvroUtils.getRowToGenericRecordFunction(schema)) Out of curiosity, for your use-case would it be acceptable if Beam peaked at the files at pipeline construction time to determine the schema for you? This is what we're doing for the new IOs in the Python SDK's DataFrame API. They're based on the pandas read_* methods, and use those methods at construction time to determine the schema. [taol] If I understand correctly, the behavior of the new dataframe API’s you are mentioning is very similar to spark parquet reader’s behaviors. If that’s the case, then it’s probably what I am looking for 😊 Brian On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko <aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>> wrote: Hi Tao, This jira [1] looks exactly what you are asking but it was merged recently (thanks to Anant Damle for working on this!) and it should be available only in Beam 2.28.0. [1] https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037837436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=boTq%2FeTLXfx%2FBxntkU1%2Fateg0OC5K5N20DGF9cIUclQ%3D&reserved=0> Regards, Alexey On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote: Hi beam community, Quick question about ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9GM3OcxTsQWcuqjm%2BnlXwRgV4pjFOqIMXmVNp6wGW4o%3D&reserved=0>. Is there a way to avoid specifying the avro schema when reading parquet files? The reason is that we may not know the parquet schema until we read the files. In comparison, spark parquet reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ds2Eko1VgUlDVnDQndoHizNeDZTRrkTa276pENCk17Y%3D&reserved=0> does not require such a schema specification. Please advise. Thanks a lot!