Re: Quick question regarding ParquetIO

Tao Li Thu, 07 Jan 2021 10:40:54 -0800

Hi Brian,

You are right. The sample code still requires the avro. Is it possible to 
retrieve the avro schema from PCollection<GenericRecord> (which is from a 
parquet read without avro schema specification with beam 2.28)? I did not have 
a chance to give it a try, but I guess we can retrieve a GeneRecord instance 
and then get the schema attached to it?

Thanks!

From: Brian Hulette <bhule...@google.com>
Date: Thursday, January 7, 2021 at 9:38 AM
To: Tao Li <t...@zillow.com>
Cc: "user@beam.apache.org" <user@beam.apache.org>
Subject: Re: Quick question regarding ParquetIO

On Wed, Jan 6, 2021 at 11:07 AM Tao Li 
<t...@zillow.com<mailto:t...@zillow.com>> wrote:
Hi Brian,

Please see my answers inline.

From: Brian Hulette <bhule...@google.com<mailto:bhule...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Date: Wednesday, January 6, 2021 at 10:43 AM
To: user <user@beam.apache.org<mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Hey Tao,

It does look like BEAM-11460 could work for you. Note that relies on a dynamic 
object which won't work with schema-aware transforms and SqlTransform. It's 
likely this isn't a problem for you, I just wanted to point it out.
[tao] I just need a PCollection<GenericRecord> from IO. Then I can apply below 
code to enable the schemas transforms (I have verified this code works).

setSchema(
      AvroUtils.toBeamSchema(schema),
      new TypeDescriptor[GenericRecord]() {},
      AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)),
      AvroUtils.getRowToGenericRecordFunction(schema))

This requires specifying the Avro schema doesn't it?

Out of curiosity, for your use-case would it be acceptable if Beam peaked at 
the files at pipeline construction time to determine the schema for you? This 
is what we're doing for the new IOs in the Python SDK's DataFrame API. They're 
based on the pandas read_* methods, and use those methods at construction time 
to determine the schema.

[taol] If I understand correctly, the behavior of the new dataframe API’s you 
are mentioning is very similar to spark parquet reader’s behaviors. If that’s 
the case, then it’s probably what I am looking for 😊

Brian

On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko 
<aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>> wrote:
Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C8a2a09c3042241c0c89308d8b3330ab8%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456379094295342%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Wn6Hfsk5gBUHeIALRXQPzoZhPo%2FX8D2PxAk1Q5bNpNM%3D&reserved=0>

Regards,
Alexey

On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C8a2a09c3042241c0c89308d8b3330ab8%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456379094305297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ag3%2F9FWY0ErfCSaNpb9bIfkk7wkBTamTvGV8VySYVI4%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C8a2a09c3042241c0c89308d8b3330ab8%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456379094305297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=x%2FK%2BSt3azbgb5asgY2Z%2FpH1jKOQs4s1bE7u22%2Bi8NPk%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!

Re: Quick question regarding ParquetIO

Reply via email to