On Wed, Jan 6, 2021 at 11:07 AM Tao Li <t...@zillow.com> wrote: > Hi Brian, > > > > Please see my answers inline. > > > > *From: *Brian Hulette <bhule...@google.com> > *Reply-To: *"user@beam.apache.org" <user@beam.apache.org> > *Date: *Wednesday, January 6, 2021 at 10:43 AM > *To: *user <user@beam.apache.org> > *Subject: *Re: Quick question regarding ParquetIO > > > > Hey Tao, > > It does look like BEAM-11460 could work for you. Note that relies on a > dynamic object which won't work with schema-aware transforms and > SqlTransform. It's likely this isn't a problem for you, I just wanted to > point it out. > > [tao] I just need a PCollection<GenericRecord> from IO. Then I can apply > below code to enable the schemas transforms (I have verified this code > works). > > > > setSchema( > > AvroUtils.toBeamSchema(schema), > > new TypeDescriptor[GenericRecord]() {}, > > > AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)), > > AvroUtils.getRowToGenericRecordFunction(schema)) >
This requires specifying the Avro schema doesn't it? > > > > > > Out of curiosity, for your use-case would it be acceptable if Beam peaked > at the files at pipeline construction time to determine the schema for you? > This is what we're doing for the new IOs in the Python SDK's DataFrame API. > They're based on the pandas read_* methods, and use those methods at > construction time to determine the schema. > > > > [taol] If I understand correctly, the behavior of the new dataframe API’s > you are mentioning is very similar to spark parquet reader’s behaviors. If > that’s the case, then it’s probably what I am looking for 😊 > > > > > > > > Brian > > > > On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko <aromanenko....@gmail.com> > wrote: > > Hi Tao, > > > > This jira [1] looks exactly what you are asking but it was merged recently > (thanks to Anant Damle for working on this!) and it should be available > only in Beam 2.28.0. > > > > [1] https://issues.apache.org/jira/browse/BEAM-11460 > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037837436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=boTq%2FeTLXfx%2FBxntkU1%2Fateg0OC5K5N20DGF9cIUclQ%3D&reserved=0> > > > > Regards, > > Alexey > > > > On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com> wrote: > > > > Hi beam community, > > > > Quick question about ParquetIO > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9GM3OcxTsQWcuqjm%2BnlXwRgV4pjFOqIMXmVNp6wGW4o%3D&reserved=0>. > Is there a way to avoid specifying the avro schema when reading parquet > files? The reason is that we may not know the parquet schema until we read > the files. In comparison, spark parquet reader > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ds2Eko1VgUlDVnDQndoHizNeDZTRrkTa276pENCk17Y%3D&reserved=0> > does > not require such a schema specification. > > > > Please advise. Thanks a lot! > > > >