Re: Quick question regarding ParquetIO

Brian Hulette Thu, 07 Jan 2021 09:38:33 -0800

On Wed, Jan 6, 2021 at 11:07 AM Tao Li <t...@zillow.com> wrote:

> Hi Brian,
>
>
>
> Please see my answers inline.
>
>
>
> *From: *Brian Hulette <bhule...@google.com>
> *Reply-To: *"user@beam.apache.org" <user@beam.apache.org>
> *Date: *Wednesday, January 6, 2021 at 10:43 AM
> *To: *user <user@beam.apache.org>
> *Subject: *Re: Quick question regarding ParquetIO
>
>
>
> Hey Tao,
>
> It does look like BEAM-11460 could work for you. Note that relies on a
> dynamic object which won't work with schema-aware transforms and
> SqlTransform. It's likely this isn't a problem for you, I just wanted to
> point it out.
>
> [tao] I just need a PCollection<GenericRecord> from IO. Then I can apply
> below code to enable the schemas transforms (I have verified this code
> works).
>
>
>
> setSchema(
>
>       AvroUtils.toBeamSchema(schema),
>
>       new TypeDescriptor[GenericRecord]() {},
>
>
> AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)),
>
>       AvroUtils.getRowToGenericRecordFunction(schema))
>


This requires specifying the Avro schema doesn't it?


>
>
>
>
>
> Out of curiosity, for your use-case would it be acceptable if Beam peaked
> at the files at pipeline construction time to determine the schema for you?
> This is what we're doing for the new IOs in the Python SDK's DataFrame API.
> They're based on the pandas read_* methods, and use those methods at
> construction time to determine the schema.
>
>
>
> [taol] If I understand correctly, the behavior of the new dataframe API’s
> you are mentioning is very similar to spark parquet reader’s behaviors. If
> that’s the case, then it’s probably what I am looking for 😊
>
>
>
>
>
>
>
> Brian
>
>
>
> On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko <aromanenko....@gmail.com>
> wrote:
>
> Hi Tao,
>
>
>
> This jira [1] looks exactly what you are asking but it was merged recently
> (thanks to Anant Damle for working on this!) and it should be available
> only in Beam 2.28.0.
>
>
>
> [1] https://issues.apache.org/jira/browse/BEAM-11460
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037837436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=boTq%2FeTLXfx%2FBxntkU1%2Fateg0OC5K5N20DGF9cIUclQ%3D&reserved=0>
>
>
>
> Regards,
>
> Alexey
>
>
>
> On 6 Jan 2021, at 18:57, Tao Li <t...@zillow.com> wrote:
>
>
>
> Hi beam community,
>
>
>
> Quick question about ParquetIO
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9GM3OcxTsQWcuqjm%2BnlXwRgV4pjFOqIMXmVNp6wGW4o%3D&reserved=0>.
> Is there a way to avoid specifying the avro schema when reading parquet
> files? The reason is that we may not know the parquet schema until we read
> the files. In comparison, spark parquet reader
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455554037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ds2Eko1VgUlDVnDQndoHizNeDZTRrkTa276pENCk17Y%3D&reserved=0>
>  does
> not require such a schema specification.
>
>
>
> Please advise. Thanks a lot!
>
>
>
>

Re: Quick question regarding ParquetIO

Reply via email to