Re: Quick question regarding ParquetIO

Tao Li Wed, 06 Jan 2021 11:07:49 -0800

Hi Brian,

Please see my answers inline.

From: Brian Hulette <bhule...@google.com>
Reply-To: "user@beam.apache.org" <user@beam.apache.org>
Date: Wednesday, January 6, 2021 at 10:43 AM
To: user <user@beam.apache.org>
Subject: Re: Quick question regarding ParquetIO

Hey Tao,

It does look like BEAM-11460 could work for you. Note that relies on a dynamic 
object which won't work with schema-aware transforms and SqlTransform. It's 
likely this isn't a problem for you, I just wanted to point it out.

[tao] I just need a PCollection<GenericRecord> from IO. Then I can apply below 
code to enable the schemas transforms (I have verified this code works).

setSchema(
      AvroUtils.toBeamSchema(schema),
      new TypeDescriptor[GenericRecord]() {},
      AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)),
      AvroUtils.getRowToGenericRecordFunction(schema))

Out of curiosity, for your use-case would it be acceptable if Beam peaked at 
the files at pipeline construction time to determine the schema for you? This 
is what we're doing for the new IOs in the Python SDK's DataFrame API. They're 
based on the pandas read_* methods, and use those methods at construction time 
to determine the schema.

[taol] If I understand correctly, the behavior of the new dataframe API’s you 
are mentioning is very similar to spark parquet reader’s behaviors. If that’s 
the case, then it’s probably what I am looking for 😊

Brian

On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko 
<aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>> wrote:
Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037837436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=boTq%2FeTLXfx%2FBxntkU1%2Fateg0OC5K5N20DGF9cIUclQ%3D&reserved=0>

Regards,
Alexey

On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9GM3OcxTsQWcuqjm%2BnlXwRgV4pjFOqIMXmVNp6wGW4o%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ds2Eko1VgUlDVnDQndoHizNeDZTRrkTa276pENCk17Y%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!

Re: Quick question regarding ParquetIO

Reply via email to