Alexey, Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to specify a schema when reading parquet files to get a PCollection<GenericRecord>. Is my understanding correct? Am I missing anything here?
Thanks! From: Alexey Romanenko <aromanenko....@gmail.com> Reply-To: "user@beam.apache.org" <user@beam.apache.org> Date: Thursday, January 7, 2021 at 9:56 AM To: "user@beam.apache.org" <user@beam.apache.org> Subject: Re: Quick question regarding ParquetIO If you want to get just a PCollection<GenericRecord> as output then you would still need to set AvroCoder, but which schema to use in this case? On 6 Jan 2021, at 19:53, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote: Hi Alexey, Thank you so much for this info. I will definitely give it a try once 2.28 is released. Regarding this feature, it’s basically mimicking the feature from AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.26.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DFj0d81U%2F0rjQ3loRTzMLBOZFdJ9rEPK2PERsu7KgAo%3D&reserved=0> I have one more quick question regarding the “reading records of an unknown schema” scenario. In the sample code a PCollection<Foo> is being returned and the parseGenericRecords requires a parsing logic. What if I just want to get a PCollection<GenericRecord> instead of a specific class (e.g. Foo in the example)? I guess I can just skip the ParquetIO.parseGenericRecords transform? So do I still have to specify the dummy parsing logic like below? Thanks! p.apply(AvroIO.parseGenericRecords(new SerializableFunction<GenericRecord, GenericRecord >() { public Foo apply(GenericRecord record) { return record; } From: Alexey Romanenko <aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>> Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" <user@beam.apache.org<mailto:user@beam.apache.org>> Date: Wednesday, January 6, 2021 at 10:13 AM To: "user@beam.apache.org<mailto:user@beam.apache.org>" <user@beam.apache.org<mailto:user@beam.apache.org>> Subject: Re: Quick question regarding ParquetIO Hi Tao, This jira [1] looks exactly what you are asking but it was merged recently (thanks to Anant Damle for working on this!) and it should be available only in Beam 2.28.0. [1] https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YpH3Rtz%2FcnE9LwfLzNyPOalaW8OUSL5sxffolKiOv%2Bk%3D&reserved=0> Regards, Alexey On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote: Hi beam community, Quick question about ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cr5MTRb4cZCLof85nfPUxtMKGRQvhJ4zLPEJa7STEjM%3D&reserved=0>. Is there a way to avoid specifying the avro schema when reading parquet files? The reason is that we may not know the parquet schema until we read the files. In comparison, spark parquet reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WJWWqx%2B4OLzHeypOs1Dyvlio9fg%2BXGGk1OgocJu3m8g%3D&reserved=0> does not require such a schema specification. Please advise. Thanks a lot!