Re: Quick question regarding ParquetIO

Alexey Romanenko Fri, 08 Jan 2021 04:48:53 -0800

Well, this is how I see it, let me explain. 

Since every PCollection is required to have a Coder to materialize the 
intermediate data, we need to have a coder for "PCollection<GenericRecord>" as 
well. If I’m not mistaken, for “GenericRecord" we used to set AvroCoder that is 
based on Avro (or Beam too?) schema.

Actually, currently it will throw an exception if you will try to use 
“parseGenericRecords()” with a PCollection<GenericRecord> as output pcollection 
since it can’t infer a Coder based on provided “parseFn”. I guess it was done 
intentially in this way and I doubt that we can have a proper coder for 
PCollection<GenericRecord> without knowing a schema. Maybe some Avro experts 
here can add more on this if we can somehow overcome it.

> On 7 Jan 2021, at 19:44, Tao Li <t...@zillow.com> wrote:
> 
> Alexey,
>  
> Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to 
> specify a schema when reading parquet files to get 
> aPCollection<GenericRecord>. Is my understanding correct? Am I missing 
> anything here?
>  
> Thanks!
>  
> From: Alexey Romanenko <aromanenko....@gmail.com>
> Reply-To: "user@beam.apache.org" <user@beam.apache.org>
> Date: Thursday, January 7, 2021 at 9:56 AM
> To: "user@beam.apache.org" <user@beam.apache.org>
> Subject: Re: Quick question regarding ParquetIO
>  
> If you want to get just a PCollection<GenericRecord> as output then you would 
> still need to set AvroCoder, but which schema to use in this case? 
> 
> 
>> On 6 Jan 2021, at 19:53, Tao Li <t...@zillow.com <mailto:t...@zillow.com>> 
>> wrote:
>>  
>> Hi Alexey,
>>  
>> Thank you so much for this info. I will definitely give it a try once 2.28 
>> is released.
>>  
>> Regarding this feature, it’s basically mimicking the feature from 
>> AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html
>>  
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.26.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DFj0d81U%2F0rjQ3loRTzMLBOZFdJ9rEPK2PERsu7KgAo%3D&reserved=0>
>>  
>> I have one more quick question regarding the “reading records of an unknown 
>> schema” scenario. In the sample code a PCollection<Foo> is being returned 
>> and the parseGenericRecords requires a parsing logic. What if I just want to 
>> get a PCollection<GenericRecord> instead of a specific class (e.g. Foo in 
>> the example)? I guess I can just skip the ParquetIO.parseGenericRecords 
>> transform? So do I still have to specify the dummy parsing logic like below? 
>> Thanks!
>>  
>> p.apply(AvroIO.parseGenericRecords(new SerializableFunction<GenericRecord, 
>> GenericRecord >() {
>>        public Foo apply(GenericRecord record) {
>>          return record;
>>        }
>>  
>> From: Alexey Romanenko <aromanenko....@gmail.com 
>> <mailto:aromanenko....@gmail.com>>
>> Reply-To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
>> <user@beam.apache.org <mailto:user@beam.apache.org>>
>> Date: Wednesday, January 6, 2021 at 10:13 AM
>> To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
>> <user@beam.apache.org <mailto:user@beam.apache.org>>
>> Subject: Re: Quick question regarding ParquetIO
>>  
>> Hi Tao,
>>  
>> This jira [1] looks exactly what you are asking but it was merged recently 
>> (thanks to Anant Damle for working on this!) and it should be available only 
>> in Beam 2.28.0.
>>  
>> [1] https://issues.apache.org/jira/browse/BEAM-11460 
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YpH3Rtz%2FcnE9LwfLzNyPOalaW8OUSL5sxffolKiOv%2Bk%3D&reserved=0>
>>  
>> Regards,
>> Alexey
>> 
>> 
>> 
>>> On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com <mailto:t...@zillow.com>> 
>>> wrote:
>>>  
>>> Hi beam community,
>>>  
>>> Quick question about ParquetIO 
>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cr5MTRb4cZCLof85nfPUxtMKGRQvhJ4zLPEJa7STEjM%3D&reserved=0>.
>>>  Is there a way to avoid specifying the avro schema when reading parquet 
>>> files? The reason is that we may not know the parquet schema until we read 
>>> the files. In comparison, spark parquet reader 
>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WJWWqx%2B4OLzHeypOs1Dyvlio9fg%2BXGGk1OgocJu3m8g%3D&reserved=0>
>>>  does not require such a schema specification.
>>>  
>>> Please advise. Thanks a lot!

Re: Quick question regarding ParquetIO

Reply via email to