Re: Quick question regarding ParquetIO

Tao Li Wed, 13 Jan 2021 10:42:33 -0800

@Kobe Feng<mailto:flllbls...@gmail.com> thank you so much for the insights. 
Agree that it may be a good practice to read all sorts of file formats (e.g. 
parquet, avro etc) into a PCollection<Row> and then perform the schema aware 
transforms that you are referring to.

The new dataframe APIs for Python SDK sound pretty cool and I can imagine it 
will save a lot of hassles during a beam app development. Hopefully it will be 
added to Java SDK as well.

From: Kobe Feng <flllbls...@gmail.com>
Reply-To: "user@beam.apache.org" <user@beam.apache.org>
Date: Friday, January 8, 2021 at 11:39 AM
To: "user@beam.apache.org" <user@beam.apache.org>
Subject: Re: Quick question regarding ParquetIO

Tao,
I'm not an expert, and good intuition, all you want is schema awareness 
transformations or let's say schema based transformation in Beam not only for 
IO but also for other DoFn, etc, and possibly have schema revolution in future 
as well.

This is how I try to understand and explain in other places before:  Not like 
spark, flink to leverage internal/built-in types (e.g, catalyst struct type)  
for built-in operators as more as possible to infer the schema when IOs could 
convert to, beam is trying to have capable to handle any type during transforms 
for people to migrate existing ones to beam (Do spark map partition func with 
own type, Encoder can't be avoided as well, right). Also yes, we could leverage 
beam own type "Row" to do all transformations and converting all in/out types 
like parquet, avro, orc, etc at IO side, and then do schema inferring in 
built-in operators base on row type when we know they will operate on internal 
types, that's how to avoid the coder or explicit schema there, more further, 
provide IO for schema registry capability and then transform will lookup when 
necessary for the revolution. I saw beam put schema base transformation in 
goals last year which will be convenient for people (since normally people 
would rather use builtin types instead of providing their own types' coder for 
following operators until we have to), that's why dataframe APIs for python SDK 
here I think.

Kobe

On Fri, Jan 8, 2021 at 9:34 AM Tao Li <t...@zillow.com<mailto:t...@zillow.com>> 
wrote:
Thanks Alexey for your explanation. That’s also what I was thinking. Parquet 
files already have the schema built in, so it might be feasible to infer a 
coder automatically (like spark parquet reader). It would be great if  we have 
some experts chime in here. @Brian Hulette<mailto:bhule...@google.com> already 
mentioned that the community is working on new DataFrame APIs in Python SDK, 
which are based on the pandas methods and use those methods at construction 
time to determine the schema. I think this is very close to the schema 
inference we have been discussing. Not sure it will be available to Java SDK 
though.

Regarding BEAM-11460, looks like it may not totally solve my problem. As 
@Alexey Romanenko<mailto:aromanenko....@gmail.com> mentioned, we may still need 
to know the avro or beam schema for following operations after the parquet 
read. A dumb question is, with BEAM-11460, after we get a 
PCollection<GenericRecord>  from parquet read (without the need to specify avro 
schema), is it possible to get the attached avro schema from a GenericRecord 
element of this PCollection<GenericRecord>?

Really appreciate it if you can help clarify my questions. Thanks!

From: Alexey Romanenko 
<aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Date: Friday, January 8, 2021 at 4:48 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Well, this is how I see it, let me explain.

Since every PCollection is required to have a Coder to materialize the 
intermediate data, we need to have a coder for "PCollection<GenericRecord>" as 
well. If I’m not mistaken, for “GenericRecord" we used to set AvroCoder that is 
based on Avro (or Beam too?) schema.

Actually, currently it will throw an exception if you will try to use 
“parseGenericRecords()” with a PCollection<GenericRecord> as output pcollection 
since it can’t infer a Coder based on provided “parseFn”. I guess it was done 
intentially in this way and I doubt that we can have a proper coder for 
PCollection<GenericRecord> without knowing a schema. Maybe some Avro experts 
here can add more on this if we can somehow overcome it.

On 7 Jan 2021, at 19:44, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Alexey,

Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to 
specify a schema when reading parquet files to get aPCollection<GenericRecord>. 
Is my understanding correct? Am I missing anything here?

Thanks!

From: Alexey Romanenko 
<aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Date: Thursday, January 7, 2021 at 9:56 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

If you want to get just a PCollection<GenericRecord> as output then you would 
still need to set AvroCoder, but which schema to use in this case?

On 6 Jan 2021, at 19:53, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Hi Alexey,

Thank you so much for this info. I will definitely give it a try once 2.28 is 
released.

Regarding this feature, it’s basically mimicking the feature from 
AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.26.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.html&data=04%7C01%7Ctaol%40zillow.com%7C66ae839b698f4463966008d8b40d1478%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637457315565097462%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zqWDMuEkWlL8MSWPAZNIeLdtK9lxuwHyNNmyALmAYr8%3D&reserved=0>

I have one more quick question regarding the “reading records of an unknown 
schema” scenario. In the sample code a PCollection<Foo> is being returned and 
the parseGenericRecords requires a parsing logic. What if I just want to get a 
PCollection<GenericRecord> instead of a specific class (e.g. Foo in the 
example)? I guess I can just skip the ParquetIO.parseGenericRecords transform? 
So do I still have to specify the dummy parsing logic like below? Thanks!

p.apply(AvroIO.parseGenericRecords(new SerializableFunction<GenericRecord, 
GenericRecord >() {
       public Foo apply(GenericRecord record) {
         return record;
       }

From: Alexey Romanenko 
<aromanenko....@gmail.com<mailto:aromanenko....@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Date: Wednesday, January 6, 2021 at 10:13 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C66ae839b698f4463966008d8b40d1478%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637457315565107417%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=naxxM2re90r1M9hxcqJeaUHBGo8i5v3beX3dwZn9Kyg%3D&reserved=0>

Regards,
Alexey

On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C66ae839b698f4463966008d8b40d1478%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637457315565107417%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mI8Y%2FK7TMVsnRp1GXq1cHYDNppfGvm5lGXrImX3k7NE%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C66ae839b698f4463966008d8b40d1478%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637457315565117373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2lRkbQsNNSQVuLT92z8FQSqGlgZXLdQvE2B2%2BkUy27o%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!

--
Yours Sincerely
Kobe Feng

Re: Quick question regarding ParquetIO

Reply via email to