Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
Hi Beam community,

I am running into a problem with 
“org.apache.beam:beam-runners-flink-1.11:2.25.0” and 
“org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing 
with the flink runners in embedded mode. The problem is that I cannot save data 
into local files using those artifact versions. However when I switched to 
“org.apache.beam:beam-runners-flink-1.10:2.24.0”, it worked fine and output 
files were saved successfully.

I am basically generating unbounded data in memory using GenerateSequence 
transform and saving it into local files. Here is the code that generates 
unlimited data in memory:

Pipeline.apply(GenerateSequence.from(0).withRate(1, new Duration(10)))
.apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1

I compared the logs and noticed that there is no write operation found in the 
logs with “beam-runners-flink-1.11:2.25.0” and 
“beam-runners-flink-1.10:2.25.0”. With the working version 
“beam-runners-flink-1.10:2.24.0”, I could find below logs that was obviously 
doing the write operation:

[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.WriteFiles - Finalizing 1 file results
[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.FileBasedSink - Will copy temporary file 
FileResult{tempFilename=/Users/taol/data/output/.temp-beam-819dbd7c-b9f7-4c8c-9d8b-20091d2eef94/010abb5e-92b0-4e95-a85d-30984e769fe2,
 shard=2, window=[2020-11-24T01:33:59.000Z..2020-11-24T01:34:00.000Z), 
paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, 
onTimeIndex=0}} to final location 
/Users/taol/data/output/output-2020-11-24T01:33:59.000Z-2020-11-24T01:34:00.000Z-2-of-00010.parquet
[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.FileBasedSink - Will remove known temporary file 
/Users/taol/data/output/.temp-beam-819dbd7c-b9f7-4c8c-9d8b-20091d2eef94/010abb5e-92b0-4e95-a85d-30984e769fe2



Is this a known issue with “beam-runners-flink-1.11:2.25.0” and 
“beam-runners-flink-1.10:2.25.0”? Can someone please take a look at this issue? 
Thanks so much!




Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
Yep it works with “--experiments=use_deprecated_read”. Is this a regression?

From: Kyle Weaver 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, November 24, 2020 at 11:08 AM
To: "user@beam.apache.org" 
Subject: Re: Potential issue with the flink runner in streaming mode

I wonder if this issue is related to the migration to Splittable DoFn [1]. Can 
you try running your pipeline again with the option 
--experiments=use_deprecated_read?

[1] 
https://beam.apache.org/blog/beam-2.25.0/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2Fbeam-2.25.0%2F&data=04%7C01%7Ctaol%40zillow.com%7Ce3981620f5bf453508ca08d890ac64b0%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637418417386743362%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7DZzSHib68xdOtotID7GA%2B1SJsVRJda2rEjR3TV%2F8JU%3D&reserved=0>

On Tue, Nov 24, 2020 at 10:19 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am running into a problem with 
“org.apache.beam:beam-runners-flink-1.11:2.25.0” and 
“org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing 
with the flink runners in embedded mode. The problem is that I cannot save data 
into local files using those artifact versions. However when I switched to 
“org.apache.beam:beam-runners-flink-1.10:2.24.0”, it worked fine and output 
files were saved successfully.

I am basically generating unbounded data in memory using GenerateSequence 
transform and saving it into local files. Here is the code that generates 
unlimited data in memory:

Pipeline.apply(GenerateSequence.from(0).withRate(1, new Duration(10)))
.apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1

I compared the logs and noticed that there is no write operation found in the 
logs with “beam-runners-flink-1.11:2.25.0” and 
“beam-runners-flink-1.10:2.25.0”. With the working version 
“beam-runners-flink-1.10:2.24.0”, I could find below logs that was obviously 
doing the write operation:

[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.WriteFiles - Finalizing 1 file results
[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.FileBasedSink - Will copy temporary file 
FileResult{tempFilename=/Users/taol/data/output/.temp-beam-819dbd7c-b9f7-4c8c-9d8b-20091d2eef94/010abb5e-92b0-4e95-a85d-30984e769fe2,
 shard=2, window=[2020-11-24T01:33:59.000Z..2020-11-24T01:34:00.000Z), 
paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, 
onTimeIndex=0}} to final location 
/Users/taol/data/output/output-2020-11-24T01:33:59.000Z-2020-11-24T01:34:00.000Z-2-of-00010.parquet
[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.FileBasedSink - Will remove known temporary file 
/Users/taol/data/output/.temp-beam-819dbd7c-b9f7-4c8c-9d8b-20091d2eef94/010abb5e-92b0-4e95-a85d-30984e769fe2



Is this a known issue with “beam-runners-flink-1.11:2.25.0” and 
“beam-runners-flink-1.10:2.25.0”? Can someone please take a look at this issue? 
Thanks so much!




Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
Thanks @Kyle Weaver<mailto:kcwea...@google.com> for filing the jira.

@Boyuan Zhang<mailto:boyu...@google.com> your understanding is correct. And as 
discussed with Kyle, adding “--experiments=use_deprecated_read” worked with 
2.24.

Here is the write part if you are interested. It’s basically saving parquet 
files. And I am using local fs for that.

data.apply(FileIO.write[GenericRecord]().withNumShards(10).via(ParquetIO.sink(schema)).to(path).withSuffix(".parquet"))

From: Boyuan Zhang 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, November 24, 2020 at 2:54 PM
To: "user@beam.apache.org" 
Subject: Re: Potential issue with the flink runner in streaming mode

Hi Tao,

I want to make sure that I understand your problem correctly. So within the 
2.25.0 version, you are not able to write the sink but within 2.24.0 you can do 
so. Do I understand correctly? Besides, could you please provide your pipeline 
with the write operation as well?

On Tue, Nov 24, 2020 at 2:50 PM Kyle Weaver 
mailto:kcwea...@google.com>> wrote:
Yeah, it looks like a regression. I filed a JIRA issue to track this issue. 
https://issues.apache.org/jira/browse/BEAM-11341<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11341&data=04%7C01%7Ctaol%40zillow.com%7Cd82167bc6d13460a3e0e08d890cbe6db%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637418552709471072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xjuSIBiVj4nzvjpU0rhN54IxAb9X3ewJwUpvZ8scjP8%3D&reserved=0>

On Tue, Nov 24, 2020 at 2:07 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Yep it works with “--experiments=use_deprecated_read”. Is this a regression?

From: Kyle Weaver mailto:kcwea...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, November 24, 2020 at 11:08 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential issue with the flink runner in streaming mode

I wonder if this issue is related to the migration to Splittable DoFn [1]. Can 
you try running your pipeline again with the option 
--experiments=use_deprecated_read?

[1] 
https://beam.apache.org/blog/beam-2.25.0/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2Fbeam-2.25.0%2F&data=04%7C01%7Ctaol%40zillow.com%7Cd82167bc6d13460a3e0e08d890cbe6db%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637418552709471072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uzqaxy5mDagDNGpNz9O35MhameVtSJzUA%2FeKXy6kn9E%3D&reserved=0>

On Tue, Nov 24, 2020 at 10:19 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am running into a problem with 
“org.apache.beam:beam-runners-flink-1.11:2.25.0” and 
“org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing 
with the flink runners in embedded mode. The problem is that I cannot save data 
into local files using those artifact versions. However when I switched to 
“org.apache.beam:beam-runners-flink-1.10:2.24.0”, it worked fine and output 
files were saved successfully.

I am basically generating unbounded data in memory using GenerateSequence 
transform and saving it into local files. Here is the code that generates 
unlimited data in memory:

Pipeline.apply(GenerateSequence.from(0).withRate(1, new Duration(10)))
.apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1

I compared the logs and noticed that there is no write operation found in the 
logs with “beam-runners-flink-1.11:2.25.0” and 
“beam-runners-flink-1.10:2.25.0”. With the working version 
“beam-runners-flink-1.10:2.24.0”, I could find below logs that was obviously 
doing the write operation:

[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.WriteFiles - Finalizing 1 file results
[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.FileBasedSink - Will copy temporary file 
FileResult{tempFilename=/Users/taol/data/output/.temp-beam-819dbd7c-b9f7-4c8c-9d8b-20091d2eef94/010abb5e-92b0-4e95-a85d-30984e769fe2,
 shard=2, window=[2020-11-24T01:33:59.000Z..2020-11-24T01:34:00.000Z), 
paneInfo=PaneInfo

Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
Correction: And as discussed with Kyle, adding 
“--experiments=use_deprecated_read” worked with 2.25.


From: Tao Li 
Date: Tuesday, November 24, 2020 at 3:19 PM
To: "user@beam.apache.org" , Kyle Weaver 
, Boyuan Zhang 
Subject: Re: Potential issue with the flink runner in streaming mode

Thanks @Kyle Weaver<mailto:kcwea...@google.com> for filing the jira.

@Boyuan Zhang<mailto:boyu...@google.com> your understanding is correct. And as 
discussed with Kyle, adding “--experiments=use_deprecated_read” worked with 
2.24.

Here is the write part if you are interested. It’s basically saving parquet 
files. And I am using local fs for that.

data.apply(FileIO.write[GenericRecord]().withNumShards(10).via(ParquetIO.sink(schema)).to(path).withSuffix(".parquet"))

From: Boyuan Zhang 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, November 24, 2020 at 2:54 PM
To: "user@beam.apache.org" 
Subject: Re: Potential issue with the flink runner in streaming mode

Hi Tao,

I want to make sure that I understand your problem correctly. So within the 
2.25.0 version, you are not able to write the sink but within 2.24.0 you can do 
so. Do I understand correctly? Besides, could you please provide your pipeline 
with the write operation as well?

On Tue, Nov 24, 2020 at 2:50 PM Kyle Weaver 
mailto:kcwea...@google.com>> wrote:
Yeah, it looks like a regression. I filed a JIRA issue to track this issue. 
https://issues.apache.org/jira/browse/BEAM-11341<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11341&data=04%7C01%7Ctaol%40zillow.com%7Cd82167bc6d13460a3e0e08d890cbe6db%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637418552709471072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xjuSIBiVj4nzvjpU0rhN54IxAb9X3ewJwUpvZ8scjP8%3D&reserved=0>

On Tue, Nov 24, 2020 at 2:07 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Yep it works with “--experiments=use_deprecated_read”. Is this a regression?

From: Kyle Weaver mailto:kcwea...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, November 24, 2020 at 11:08 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential issue with the flink runner in streaming mode

I wonder if this issue is related to the migration to Splittable DoFn [1]. Can 
you try running your pipeline again with the option 
--experiments=use_deprecated_read?

[1] 
https://beam.apache.org/blog/beam-2.25.0/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2Fbeam-2.25.0%2F&data=04%7C01%7Ctaol%40zillow.com%7Cd82167bc6d13460a3e0e08d890cbe6db%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637418552709471072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uzqaxy5mDagDNGpNz9O35MhameVtSJzUA%2FeKXy6kn9E%3D&reserved=0>

On Tue, Nov 24, 2020 at 10:19 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am running into a problem with 
“org.apache.beam:beam-runners-flink-1.11:2.25.0” and 
“org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing 
with the flink runners in embedded mode. The problem is that I cannot save data 
into local files using those artifact versions. However when I switched to 
“org.apache.beam:beam-runners-flink-1.10:2.24.0”, it worked fine and output 
files were saved successfully.

I am basically generating unbounded data in memory using GenerateSequence 
transform and saving it into local files. Here is the code that generates 
unlimited data in memory:

Pipeline.apply(GenerateSequence.from(0).withRate(1, new Duration(10)))
.apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1

I compared the logs and noticed that there is no write operation found in the 
logs with “beam-runners-flink-1.11:2.25.0” and 
“beam-runners-flink-1.10:2.25.0”. With the working version 
“beam-runners-flink-1.10:2.24.0”, I could find below logs that was obviously 
doing the write operation:

[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(AssignShard) (9/12)] INFO 
org.apache.beam.sdk.io.WriteFiles - Finalizing 1 file results
[FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous)
 -> 
FileIO.Write/WriteFiles/FinalizeTempFileBundles/Finalize/ParMultiDo(Finalize) 
-> FileIO.Write/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair 
with random key/ParMultiDo(Assign

Re: Potential issue with the flink runner in streaming mode

2020-11-25 Thread Tao Li
It’s a steaming pipeline. I am using the below code to generate unbounded data 
source and beam is running in streaming mode:
Pipeline.apply(GenerateSequence.from(0).withRate(1, new 
Duration(10))).apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1


From: Boyuan Zhang 
Date: Tuesday, November 24, 2020 at 3:27 PM
To: Tao Li 
Cc: "user@beam.apache.org" , Kyle Weaver 

Subject: Re: Potential issue with the flink runner in streaming mode

And is it a batch pipeline or a streaming pipeline?

On Tue, Nov 24, 2020 at 3:25 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Correction: And as discussed with Kyle, adding 
“--experiments=use_deprecated_read” worked with 2.25.


From: Tao Li mailto:t...@zillow.com>>
Date: Tuesday, November 24, 2020 at 3:19 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>, Kyle Weaver 
mailto:kcwea...@google.com>>, Boyuan Zhang 
mailto:boyu...@google.com>>
Subject: Re: Potential issue with the flink runner in streaming mode

Thanks @Kyle Weaver<mailto:kcwea...@google.com> for filing the jira.

@Boyuan Zhang<mailto:boyu...@google.com> your understanding is correct. And as 
discussed with Kyle, adding “--experiments=use_deprecated_read” worked with 
2.24.

Here is the write part if you are interested. It’s basically saving parquet 
files. And I am using local fs for that.

data.apply(FileIO.write[GenericRecord]().withNumShards(10).via(ParquetIO.sink(schema)).to(path).withSuffix(".parquet"))

From: Boyuan Zhang mailto:boyu...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, November 24, 2020 at 2:54 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential issue with the flink runner in streaming mode

Hi Tao,

I want to make sure that I understand your problem correctly. So within the 
2.25.0 version, you are not able to write the sink but within 2.24.0 you can do 
so. Do I understand correctly? Besides, could you please provide your pipeline 
with the write operation as well?

On Tue, Nov 24, 2020 at 2:50 PM Kyle Weaver 
mailto:kcwea...@google.com>> wrote:
Yeah, it looks like a regression. I filed a JIRA issue to track this issue. 
https://issues.apache.org/jira/browse/BEAM-11341<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11341&data=04%7C01%7Ctaol%40zillow.com%7C9ca9b56f7f3443ac421b08d890d093da%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637418572792396377%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VFs1BgSsbyk4fadpkwsMUj%2BP5eNeV5x%2BtcNZmgJdO74%3D&reserved=0>

On Tue, Nov 24, 2020 at 2:07 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Yep it works with “--experiments=use_deprecated_read”. Is this a regression?

From: Kyle Weaver mailto:kcwea...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, November 24, 2020 at 11:08 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential issue with the flink runner in streaming mode

I wonder if this issue is related to the migration to Splittable DoFn [1]. Can 
you try running your pipeline again with the option 
--experiments=use_deprecated_read?

[1] 
https://beam.apache.org/blog/beam-2.25.0/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2Fbeam-2.25.0%2F&data=04%7C01%7Ctaol%40zillow.com%7C9ca9b56f7f3443ac421b08d890d093da%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637418572792406375%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Osy8oO5Xe6ePXGORhz10B9rT2YHsYOm%2BMOgwLEasPys%3D&reserved=0>

On Tue, Nov 24, 2020 at 10:19 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am running into a problem with 
“org.apache.beam:beam-runners-flink-1.11:2.25.0” and 
“org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing 
with the flink runners in embedded mode. The problem is that I cannot save data 
into local files using those artifact versions. However when I switched to 
“org.apache.beam:beam-runners-flink-1.10:2.24.0”, it worked fine and output 
files were saved successfully.

I am basically generating unbounded data in memory using GenerateSequence 
transform and saving it into local files. Here is the code that generates 
unlimited data in memory:

Pipeline.apply(GenerateSequence.from(0).withRate(1, new Duration(10)))
.apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1

I compared the logs and noticed that there is no

Re: Potential issue with the flink runner in streaming mode

2020-11-30 Thread Tao Li
Thanks Boyuan!

From: Boyuan Zhang 
Date: Wednesday, November 25, 2020 at 10:52 AM
To: Tao Li 
Cc: "user@beam.apache.org" , Kyle Weaver 

Subject: Re: Potential issue with the flink runner in streaming mode

Thanks for reporting this issue, Tao. That's all I need for debugging this 
issue and hopefully we could fix this issue soon.

On Wed, Nov 25, 2020 at 10:39 AM Tao Li 
mailto:t...@zillow.com>> wrote:
It’s a steaming pipeline. I am using the below code to generate unbounded data 
source and beam is running in streaming mode:
Pipeline.apply(GenerateSequence.from(0).withRate(1, new 
Duration(10))).apply(Window.into[java.lang.Long](FixedWindows.of(Duration.standardSeconds(1


From: Boyuan Zhang mailto:boyu...@google.com>>
Date: Tuesday, November 24, 2020 at 3:27 PM
To: Tao Li mailto:t...@zillow.com>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>, Kyle Weaver 
mailto:kcwea...@google.com>>
Subject: Re: Potential issue with the flink runner in streaming mode

And is it a batch pipeline or a streaming pipeline?

On Tue, Nov 24, 2020 at 3:25 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Correction: And as discussed with Kyle, adding 
“--experiments=use_deprecated_read” worked with 2.25.


From: Tao Li mailto:t...@zillow.com>>
Date: Tuesday, November 24, 2020 at 3:19 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>, Kyle Weaver 
mailto:kcwea...@google.com>>, Boyuan Zhang 
mailto:boyu...@google.com>>
Subject: Re: Potential issue with the flink runner in streaming mode

Thanks @Kyle Weaver<mailto:kcwea...@google.com> for filing the jira.

@Boyuan Zhang<mailto:boyu...@google.com> your understanding is correct. And as 
discussed with Kyle, adding “--experiments=use_deprecated_read” worked with 
2.24.

Here is the write part if you are interested. It’s basically saving parquet 
files. And I am using local fs for that.

data.apply(FileIO.write[GenericRecord]().withNumShards(10).via(ParquetIO.sink(schema)).to(path).withSuffix(".parquet"))

From: Boyuan Zhang mailto:boyu...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, November 24, 2020 at 2:54 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential issue with the flink runner in streaming mode

Hi Tao,

I want to make sure that I understand your problem correctly. So within the 
2.25.0 version, you are not able to write the sink but within 2.24.0 you can do 
so. Do I understand correctly? Besides, could you please provide your pipeline 
with the write operation as well?

On Tue, Nov 24, 2020 at 2:50 PM Kyle Weaver 
mailto:kcwea...@google.com>> wrote:
Yeah, it looks like a regression. I filed a JIRA issue to track this issue. 
https://issues.apache.org/jira/browse/BEAM-11341<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11341&data=04%7C01%7Ctaol%40zillow.com%7C37364bcbe247407ca6b208d891734fcb%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637419271732142366%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DHTW3saySmdXa1v4twwOLm3%2Bh0VNazrd%2BxaWJUGsDng%3D&reserved=0>

On Tue, Nov 24, 2020 at 2:07 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Yep it works with “--experiments=use_deprecated_read”. Is this a regression?

From: Kyle Weaver mailto:kcwea...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, November 24, 2020 at 11:08 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential issue with the flink runner in streaming mode

I wonder if this issue is related to the migration to Splittable DoFn [1]. Can 
you try running your pipeline again with the option 
--experiments=use_deprecated_read?

[1] 
https://beam.apache.org/blog/beam-2.25.0/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2Fbeam-2.25.0%2F&data=04%7C01%7Ctaol%40zillow.com%7C37364bcbe247407ca6b208d891734fcb%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637419271732152326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=srAtd9MTrf71A%2BOE6cE9JLe6Sg1C%2BUd1BT2a%2B0xO25c%3D&reserved=0>

On Tue, Nov 24, 2020 at 10:19 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am running into a problem with 
“org.apache.beam:beam-runners-flink-1.11:2.25.0” and 
“org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing 
with the flink runners in embedded mode

Quick question regarding production readiness of ParquetIO

2020-11-30 Thread Tao Li
Hi Beam community,

According to this link the  ParquetIO is still considered experimental: 
https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html

Does it mean it’s not yet ready for prod usage? If that’s the case, when will 
it be ready?

Also, is there any known performance/scalability/reliability issue with 
ParquetIO?

Thanks a lot!


Quick question about KafkaIO.Write

2020-12-08 Thread Tao Li
Hi Beam community,

I got a quick question about withValueSerializer() method of KafkaIO.Write 
class: 
https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/kafka/KafkaIO.Write.html

The withValueSerializer method does not support passing in a serializer 
provider. The problem with lacking that functionality is that I cannot use 
Kafka schema registry to fetch the schema for serialization.

However at the same time, the KafkaIO.Read withKeyDeserializer method 
supports specifying a deserializer provider: 
https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withKeyDeserializer-org.apache.beam.sdk.io.kafka.DeserializerProvider-

Is this a gap for KafkaIO.Write or is it by design? Is there a workaround 
to specify the schema registry info for KafkaIO.Write?

Thanks so much!


Re: Quick question about KafkaIO.Write

2020-12-10 Thread Tao Li
@Alexey Romanenko<mailto:aromanenko@gmail.com> thanks so much for your 
suggestions.

Actually I found the below code seems to work.

KafkaIO
.write()
.withBootstrapServers(bootstrapServers)
.withTopic(topicName)
.withValueSerializer((Class) KafkaAvroSerializer.class)
.withProducerConfigUpdates(ImmutableMap.of("schema.registry.url", 
schemaRegistryUrl))

Thanks and I hope there will be more great improvements coming in future, as 
you mentioned 😊

From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, December 9, 2020 at 9:08 AM
To: "user@beam.apache.org" 
Subject: Re: Quick question about KafkaIO.Write

AFAIR, DeserializerProvider in KafkaIO was added along with adding a Confluent 
Schema Registry's support in KafkaIO.Read to provide a universal way to use 
different Deserializers (it’s Local and ConfluentSchemaRegistry for the moment).

Regarding Write part, I believe we can do the similar refactoring. Feel free to 
provide a patch, we can help with review/testing/advices.

For now, just an idea of workaround (I didn’t test it) - you need to fetch your 
schema from Schema Registry in advance by yourself with SchemaRegistryClient to 
create an Avro record for write (e.g. GenericRecord) and then set 
KafkaAvroSerializer as ValueSerializer and specify “schema.registry.url” in 
producer properties.


On 8 Dec 2020, at 20:59, Tao Li mailto:t...@zillow.com>> wrote:

Hi Beam community,

I got a quick question about withValueSerializer() method of KafkaIO.Write 
class:https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/kafka/KafkaIO.Write.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fkafka%2FKafkaIO.Write.html&data=04%7C01%7Ctaol%40zillow.com%7C8e4ef48dfd1943bf9b5108d89c65082f%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637431305030435395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6%2FEKkQe0domyHoqg%2FTe6bZPZBFypiQtmKRPgyKj0w1o%3D&reserved=0>

The withValueSerializer method does not support passing in a serializer 
provider. The problem with lacking that functionality is that I cannot use 
Kafka schema registry to fetch the schema for serialization.

However at the same time, the KafkaIO.Read withKeyDeserializer method 
supports specifying a deserializer 
provider:https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withKeyDeserializer-org.apache.beam.sdk.io.kafka.DeserializerProvider-<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fkafka%2FKafkaIO.Read.html%23withKeyDeserializer-org.apache.beam.sdk.io.kafka.DeserializerProvider-&data=04%7C01%7Ctaol%40zillow.com%7C8e4ef48dfd1943bf9b5108d89c65082f%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637431305030435395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OXx6Gtd00OsIrw9yMP1kflAwpXzveb%2FM6IoggnkpWkk%3D&reserved=0>

Is this a gap for KafkaIO.Write or is it by design? Is there a workaround 
to specify the schema registry info for KafkaIO.Write?

Thanks so much!



Question regarding GoupByKey operator on unbounded data

2020-12-10 Thread Tao Li
Hi Beam community,

I got a quick question about GoupByKey operator. According to this 
doc,  if 
we are using unbounded PCollection, it’s required to specify either non-global 
windowing
 or an aggregation 
trigger in 
order to perform a GroupByKey operation.

In comparison, 
KeyBy
 operator from flink does not have such a hard requirement for streamed data.

In our use case, we do need to query all historical streamed data and group by 
keys. KeyBy from flink satisfies our need, but Beam GoupByKey does not satisfy 
this need. I thought about applying a sliding window with a very large size 
(say 1 year), thus we can query the past 1 year’s data. But not sure if this is 
feasible or a good practice.

So what would the Beam solution be to implement this business logic? Is there a 
support from beam to process a relative long history of a unbounded PCollection?

Thanks so much!



Re: Question regarding GoupByKey operator on unbounded data

2020-12-11 Thread Tao Li
Hi @Reuven Lax<mailto:re...@google.com> basically we have a flink app that does 
a stream processing. It uses a KeyBy operation to generate a keyed stream. 
Since we need to query all historical data of the input, we are not specifying 
a window function or a trigger in this flink app, which is fine.

Now we would like to convert this flink app to a beam app. The problem is that 
for a unbounded PCollection, beam requires either a non-global windowing or an 
aggregation trigger to perform a GroupByKey operation.
I was thinking about applying a sliding window with a huge size (say 1 year) to 
accommodate this Beam requirement. But not sure if this is feasible or a good 
practice.
So what’s your recommendation to solve this problem? Thanks!


From: Reuven Lax 
Reply-To: "user@beam.apache.org" 
Date: Thursday, December 10, 2020 at 3:07 PM
To: user 
Cc: Mehmet Emre Sahin , Ying-Chang Cheng 

Subject: Re: Question regarding GoupByKey operator on unbounded data

Can you explain more about what exactly you are trying to do?

On Thu, Dec 10, 2020 at 2:51 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I got a quick question about GoupByKey operator. According to this 
doc<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23groupbykey&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625379438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gpajZ9AnwwOJAZJE9OwaIaEx5BGWzdhuR6wi67OPP3A%3D&reserved=0>,
  if we are using unbounded PCollection, it’s required to specify either 
non-global 
windowing<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23setting-your-pcollections-windowing-function&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625379438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cWJBj0EbyCThofC3Gu386kXiBVwQgdTVXKg%2B4SHQOlw%3D&reserved=0>
 or an aggregation 
trigger<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23triggers&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625389432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CD41nwXkDndqTk8Ct3A%2FhToUaG%2BJiCnGIRmttHKdwEI%3D&reserved=0>
 in order to perform a GroupByKey operation.

In comparison, 
KeyBy<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-stable%2Fdev%2Fstream%2Foperators%2F&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625389432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=J3I%2B5bnnHJ7bIXqjeMIf2vWlbfgb1zIiidhho7L%2FdpE%3D&reserved=0>
 operator from flink does not have such a hard requirement for streamed data.

In our use case, we do need to query all historical streamed data and group by 
keys. KeyBy from flink satisfies our need, but Beam GoupByKey does not satisfy 
this need. I thought about applying a sliding window with a very large size 
(say 1 year), thus we can query the past 1 year’s data. But not sure if this is 
feasible or a good practice.

So what would the Beam solution be to implement this business logic? Is there a 
support from beam to process a relative long history of a unbounded PCollection?

Thanks so much!



Re: Question regarding GoupByKey operator on unbounded data

2020-12-11 Thread Tao Li
Would Combine.PerKey work for my case? Seems like it does not require a window 
function.

At the same time it seems that this operator is typically used to generate some 
aggregated output (e.g. count) instead of the value list. So I am not sure if 
it’s suitable for my use case.

Please advise. Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, December 11, 2020 at 10:29 AM
To: "user@beam.apache.org" , Reuven Lax 
Cc: Mehmet Emre Sahin , Ying-Chang Cheng 

Subject: Re: Question regarding GoupByKey operator on unbounded data

Hi @Reuven Lax<mailto:re...@google.com> basically we have a flink app that does 
a stream processing. It uses a KeyBy operation to generate a keyed stream. 
Since we need to query all historical data of the input, we are not specifying 
a window function or a trigger in this flink app, which is fine.

Now we would like to convert this flink app to a beam app. The problem is that 
for a unbounded PCollection, beam requires either a non-global windowing or an 
aggregation trigger to perform a GroupByKey operation.
I was thinking about applying a sliding window with a huge size (say 1 year) to 
accommodate this Beam requirement. But not sure if this is feasible or a good 
practice.
So what’s your recommendation to solve this problem? Thanks!


From: Reuven Lax 
Reply-To: "user@beam.apache.org" 
Date: Thursday, December 10, 2020 at 3:07 PM
To: user 
Cc: Mehmet Emre Sahin , Ying-Chang Cheng 

Subject: Re: Question regarding GoupByKey operator on unbounded data

Can you explain more about what exactly you are trying to do?

On Thu, Dec 10, 2020 at 2:51 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I got a quick question about GoupByKey operator. According to this 
doc<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23groupbykey&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625379438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gpajZ9AnwwOJAZJE9OwaIaEx5BGWzdhuR6wi67OPP3A%3D&reserved=0>,
  if we are using unbounded PCollection, it’s required to specify either 
non-global 
windowing<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23setting-your-pcollections-windowing-function&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625379438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cWJBj0EbyCThofC3Gu386kXiBVwQgdTVXKg%2B4SHQOlw%3D&reserved=0>
 or an aggregation 
trigger<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23triggers&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625389432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CD41nwXkDndqTk8Ct3A%2FhToUaG%2BJiCnGIRmttHKdwEI%3D&reserved=0>
 in order to perform a GroupByKey operation.

In comparison, 
KeyBy<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-stable%2Fdev%2Fstream%2Foperators%2F&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625389432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=J3I%2B5bnnHJ7bIXqjeMIf2vWlbfgb1zIiidhho7L%2FdpE%3D&reserved=0>
 operator from flink does not have such a hard requirement for streamed data.

In our use case, we do need to query all historical streamed data and group by 
keys. KeyBy from flink satisfies our need, but Beam GoupByKey does not satisfy 
this need. I thought about applying a sliding window with a very large size 
(say 1 year), thus we can query the past 1 year’s data. But not sure if this is 
feasible or a good practice.

So what would the Beam solution be to implement this business logic? Is there a 
support from beam to process a relative long history of a unbounded PCollection?

Thanks so much!



Re: Question regarding GoupByKey operator on unbounded data

2020-12-12 Thread Tao Li
Sorry I think I had some misunderstanding on keyBy API from Flink. It’s not 
exactly equivalent to GroupByKey from Beam. So please ignore my question and 
this email thread. Thanks for help though 😊

From: Tao Li 
Date: Friday, December 11, 2020 at 7:29 PM
To: "user@beam.apache.org" , Reuven Lax 
Cc: Mehmet Emre Sahin , Ying-Chang Cheng 

Subject: Re: Question regarding GoupByKey operator on unbounded data

Would Combine.PerKey work for my case? Seems like it does not require a window 
function.

At the same time it seems that this operator is typically used to generate some 
aggregated output (e.g. count) instead of the value list. So I am not sure if 
it’s suitable for my use case.

Please advise. Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, December 11, 2020 at 10:29 AM
To: "user@beam.apache.org" , Reuven Lax 
Cc: Mehmet Emre Sahin , Ying-Chang Cheng 

Subject: Re: Question regarding GoupByKey operator on unbounded data

Hi @Reuven Lax<mailto:re...@google.com> basically we have a flink app that does 
a stream processing. It uses a KeyBy operation to generate a keyed stream. 
Since we need to query all historical data of the input, we are not specifying 
a window function or a trigger in this flink app, which is fine.

Now we would like to convert this flink app to a beam app. The problem is that 
for a unbounded PCollection, beam requires either a non-global windowing or an 
aggregation trigger to perform a GroupByKey operation.
I was thinking about applying a sliding window with a huge size (say 1 year) to 
accommodate this Beam requirement. But not sure if this is feasible or a good 
practice.
So what’s your recommendation to solve this problem? Thanks!


From: Reuven Lax 
Reply-To: "user@beam.apache.org" 
Date: Thursday, December 10, 2020 at 3:07 PM
To: user 
Cc: Mehmet Emre Sahin , Ying-Chang Cheng 

Subject: Re: Question regarding GoupByKey operator on unbounded data

Can you explain more about what exactly you are trying to do?

On Thu, Dec 10, 2020 at 2:51 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I got a quick question about GoupByKey operator. According to this 
doc<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23groupbykey&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625379438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gpajZ9AnwwOJAZJE9OwaIaEx5BGWzdhuR6wi67OPP3A%3D&reserved=0>,
  if we are using unbounded PCollection, it’s required to specify either 
non-global 
windowing<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23setting-your-pcollections-windowing-function&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625379438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cWJBj0EbyCThofC3Gu386kXiBVwQgdTVXKg%2B4SHQOlw%3D&reserved=0>
 or an aggregation 
trigger<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23triggers&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625389432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CD41nwXkDndqTk8Ct3A%2FhToUaG%2BJiCnGIRmttHKdwEI%3D&reserved=0>
 in order to perform a GroupByKey operation.

In comparison, 
KeyBy<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-stable%2Fdev%2Fstream%2Foperators%2F&data=04%7C01%7Ctaol%40zillow.com%7C8e269ce605e246c1103708d89d60650e%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637432384625389432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=J3I%2B5bnnHJ7bIXqjeMIf2vWlbfgb1zIiidhho7L%2FdpE%3D&reserved=0>
 operator from flink does not have such a hard requirement for streamed data.

In our use case, we do need to query all historical streamed data and group by 
keys. KeyBy from flink satisfies our need, but Beam GoupByKey does not satisfy 
this need. I thought about applying a sliding window with a very large size 
(say 1 year), thus we can query the past 1 year’s data. But not sure if this is 
feasible or a good practice.

So what would the Beam solution be to implement this business logic? Is there a 
support from beam to process a relative long history of a unbounded PCollection?

Thanks so much!



Re: Question regarding GoupByKey operator on unbounded data

2020-12-16 Thread Tao Li
@Jan Lukavský<mailto:je...@seznam.cz> yes that’s exactly what I figured out a 
few days back. I was using WithKey transform to create a PCollection>. 
Thanks for your help!

From: Jan Lukavský 
Reply-To: "user@beam.apache.org" 
Date: Monday, December 14, 2020 at 2:05 AM
To: "user@beam.apache.org" 
Subject: Re: Question regarding GoupByKey operator on unbounded data


Hi,

I think what you might be looking for is "stateful processing", please have a 
look at [1]. Note that input to stateful DoFn must be of type KV, which 
then ensures similar behavior to Flink's keyBy.

Best,

 Jan

[1] 
https://beam.apache.org/blog/stateful-processing/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2Fstateful-processing%2F&data=04%7C01%7Ctaol%40zillow.com%7Cc642ce6422b94eb7f2dc08d8a017caaf%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637435371345143676%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4YxlLQZj77s4dQnUVO%2BQ0mbyd8qOLTtejBTG93jCOCg%3D&reserved=0>
On 12/13/20 6:27 AM, Tao Li wrote:

Sorry I think I had some misunderstanding on keyBy API from Flink. It’s not 
exactly equivalent to GroupByKey from Beam. So please ignore my question and 
this email thread. Thanks for help though 😊

From: Tao Li <mailto:t...@zillow.com>
Date: Friday, December 11, 2020 at 7:29 PM
To: "user@beam.apache.org"<mailto:user@beam.apache.org> 
<mailto:user@beam.apache.org>, Reuven Lax 
<mailto:re...@google.com>
Cc: Mehmet Emre Sahin <mailto:mehm...@zillow.com>, 
Ying-Chang Cheng <mailto:yingcha...@zillowgroup.com>
Subject: Re: Question regarding GoupByKey operator on unbounded data

Would Combine.PerKey work for my case? Seems like it does not require a window 
function.

At the same time it seems that this operator is typically used to generate some 
aggregated output (e.g. count) instead of the value list. So I am not sure if 
it’s suitable for my use case.

Please advise. Thanks!

From: Tao Li <mailto:t...@zillow.com>
Reply-To: "user@beam.apache.org"<mailto:user@beam.apache.org> 
<mailto:user@beam.apache.org>
Date: Friday, December 11, 2020 at 10:29 AM
To: "user@beam.apache.org"<mailto:user@beam.apache.org> 
<mailto:user@beam.apache.org>, Reuven Lax 
<mailto:re...@google.com>
Cc: Mehmet Emre Sahin <mailto:mehm...@zillow.com>, 
Ying-Chang Cheng <mailto:yingcha...@zillowgroup.com>
Subject: Re: Question regarding GoupByKey operator on unbounded data

Hi @Reuven Lax<mailto:re...@google.com> basically we have a flink app that does 
a stream processing. It uses a KeyBy operation to generate a keyed stream. 
Since we need to query all historical data of the input, we are not specifying 
a window function or a trigger in this flink app, which is fine.

Now we would like to convert this flink app to a beam app. The problem is that 
for a unbounded PCollection, beam requires either a non-global windowing or an 
aggregation trigger to perform a GroupByKey operation.
I was thinking about applying a sliding window with a huge size (say 1 year) to 
accommodate this Beam requirement. But not sure if this is feasible or a good 
practice.
So what’s your recommendation to solve this problem? Thanks!


From: Reuven Lax <mailto:re...@google.com>
Reply-To: "user@beam.apache.org"<mailto:user@beam.apache.org> 
<mailto:user@beam.apache.org>
Date: Thursday, December 10, 2020 at 3:07 PM
To: user <mailto:user@beam.apache.org>
Cc: Mehmet Emre Sahin <mailto:mehm...@zillow.com>, 
Ying-Chang Cheng <mailto:yingcha...@zillowgroup.com>
Subject: Re: Question regarding GoupByKey operator on unbounded data

Can you explain more about what exactly you are trying to do?

On Thu, Dec 10, 2020 at 2:51 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I got a quick question about GoupByKey operator. According to this 
doc<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23groupbykey&data=04%7C01%7Ctaol%40zillow.com%7Cc642ce6422b94eb7f2dc08d8a017caaf%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637435371345153667%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qVHPVAE2kqDxhDLvky9EoyvhVhU1C1VSlsJPPksqNO4%3D&reserved=0>,
  if we are using unbounded PCollection, it’s required to specify either 
non-global 
windowing<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fprogramming-guide%2F%23setting-your-pcollections-windowing-function&data=04%7C01%7Ctaol%40zillow.com%7Cc642ce6422b94eb7f2dc08d8a017caaf%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637435371345153667%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC

Quick question regarding ParquetIO

2021-01-06 Thread Tao Li
Hi beam community,

Quick question about 
ParquetIO.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader does 
not require such a schema specification.

Please advise. Thanks a lot!


Re: Quick question regarding ParquetIO

2021-01-06 Thread Tao Li
Hi Alexey,

Thank you so much for this info. I will definitely give it a try once 2.28 is 
released.

Regarding this feature, it’s basically mimicking the feature from AvroIO: 
https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html

I have one more quick question regarding the “reading records of an unknown 
schema” scenario. In the sample code a PCollection is being returned and 
the parseGenericRecords requires a parsing logic. What if I just want to get a 
PCollection instead of a specific class (e.g. Foo in the 
example)? I guess I can just skip the ParquetIO.parseGenericRecords transform? 
So do I still have to specify the dummy parsing logic like below? Thanks!

p.apply(AvroIO.parseGenericRecords(new SerializableFunction() {
   public Foo apply(GenericRecord record) {
 return record;
   }

From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, January 6, 2021 at 10:13 AM
To: "user@beam.apache.org" 
Subject: Re: Quick question regarding ParquetIO

Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7Cc1a2c7a32ee64bdaf32b08d8b26ec466%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455536115879373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pLjqharsCRGvC7%2FJNPtOwMBAsXbNfujs%2BCnbbew0MLA%3D&reserved=0>

Regards,
Alexey


On 6 Jan 2021, at 18:57, Tao Li mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7Cc1a2c7a32ee64bdaf32b08d8b26ec466%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455536115889330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NvZGeUUZoMNBqRVBNNviMUq6uanJH4XNk05EEHTrngc%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7Cc1a2c7a32ee64bdaf32b08d8b26ec466%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637455536115889330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xc4IanHypjltv8PeeDbt9eSQpgyFNUxE9nv1SgB2eTQ%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!



Re: Quick question regarding ParquetIO

2021-01-06 Thread Tao Li
Hi Brian,

Please see my answers inline.

From: Brian Hulette 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, January 6, 2021 at 10:43 AM
To: user 
Subject: Re: Quick question regarding ParquetIO

Hey Tao,

It does look like BEAM-11460 could work for you. Note that relies on a dynamic 
object which won't work with schema-aware transforms and SqlTransform. It's 
likely this isn't a problem for you, I just wanted to point it out.

[tao] I just need a PCollection from IO. Then I can apply below 
code to enable the schemas transforms (I have verified this code works).

setSchema(
  AvroUtils.toBeamSchema(schema),
  new TypeDescriptor[GenericRecord]() {},
  AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)),
  AvroUtils.getRowToGenericRecordFunction(schema))



Out of curiosity, for your use-case would it be acceptable if Beam peaked at 
the files at pipeline construction time to determine the schema for you? This 
is what we're doing for the new IOs in the Python SDK's DataFrame API. They're 
based on the pandas read_* methods, and use those methods at construction time 
to determine the schema.

[taol] If I understand correctly, the behavior of the new dataframe API’s you 
are mentioning is very similar to spark parquet reader’s behaviors. If that’s 
the case, then it’s probably what I am looking for 😊



Brian

On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:
Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C63744037837436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=boTq%2FeTLXfx%2FBxntkU1%2Fateg0OC5K5N20DGF9cIUclQ%3D&reserved=0>

Regards,
Alexey


On 6 Jan 2021, at 18:57, Tao Li mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C63744037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9GM3OcxTsQWcuqjm%2BnlXwRgV4pjFOqIMXmVNp6wGW4o%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C757988eb460d4478450208d8b272f105%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C63744037847391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ds2Eko1VgUlDVnDQndoHizNeDZTRrkTa276pENCk17Y%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!



Re: Quick question regarding ParquetIO

2021-01-07 Thread Tao Li
Hi Brian,

You are right. The sample code still requires the avro. Is it possible to 
retrieve the avro schema from PCollection (which is from a 
parquet read without avro schema specification with beam 2.28)? I did not have 
a chance to give it a try, but I guess we can retrieve a GeneRecord instance 
and then get the schema attached to it?

Thanks!

From: Brian Hulette 
Date: Thursday, January 7, 2021 at 9:38 AM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: Quick question regarding ParquetIO



On Wed, Jan 6, 2021 at 11:07 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Brian,

Please see my answers inline.

From: Brian Hulette mailto:bhule...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Wednesday, January 6, 2021 at 10:43 AM
To: user mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Hey Tao,

It does look like BEAM-11460 could work for you. Note that relies on a dynamic 
object which won't work with schema-aware transforms and SqlTransform. It's 
likely this isn't a problem for you, I just wanted to point it out.
[tao] I just need a PCollection from IO. Then I can apply below 
code to enable the schemas transforms (I have verified this code works).

setSchema(
  AvroUtils.toBeamSchema(schema),
  new TypeDescriptor[GenericRecord]() {},
  AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(schema)),
  AvroUtils.getRowToGenericRecordFunction(schema))

This requires specifying the Avro schema doesn't it?




Out of curiosity, for your use-case would it be acceptable if Beam peaked at 
the files at pipeline construction time to determine the schema for you? This 
is what we're doing for the new IOs in the Python SDK's DataFrame API. They're 
based on the pandas read_* methods, and use those methods at construction time 
to determine the schema.

[taol] If I understand correctly, the behavior of the new dataframe API’s you 
are mentioning is very similar to spark parquet reader’s behaviors. If that’s 
the case, then it’s probably what I am looking for 😊



Brian

On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:
Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C8a2a09c3042241c0c89308d8b3330ab8%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456379094295342%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Wn6Hfsk5gBUHeIALRXQPzoZhPo%2FX8D2PxAk1Q5bNpNM%3D&reserved=0>

Regards,
Alexey

On 6 Jan 2021, at 18:57, Tao Li mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C8a2a09c3042241c0c89308d8b3330ab8%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456379094305297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ag3%2F9FWY0ErfCSaNpb9bIfkk7wkBTamTvGV8VySYVI4%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C8a2a09c3042241c0c89308d8b3330ab8%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456379094305297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=x%2FK%2BSt3azbgb5asgY2Z%2FpH1jKOQs4s1bE7u22%2Bi8NPk%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!



Re: Quick question regarding ParquetIO

2021-01-07 Thread Tao Li
Alexey,

Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to 
specify a schema when reading parquet files to get a 
PCollection. Is my understanding correct? Am I missing anything 
here?

Thanks!

From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 7, 2021 at 9:56 AM
To: "user@beam.apache.org" 
Subject: Re: Quick question regarding ParquetIO

If you want to get just a PCollection as output then you would 
still need to set AvroCoder, but which schema to use in this case?


On 6 Jan 2021, at 19:53, Tao Li mailto:t...@zillow.com>> wrote:

Hi Alexey,

Thank you so much for this info. I will definitely give it a try once 2.28 is 
released.

Regarding this feature, it’s basically mimicking the feature from 
AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.26.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DFj0d81U%2F0rjQ3loRTzMLBOZFdJ9rEPK2PERsu7KgAo%3D&reserved=0>

I have one more quick question regarding the “reading records of an unknown 
schema” scenario. In the sample code a PCollection is being returned and 
the parseGenericRecords requires a parsing logic. What if I just want to get a 
PCollection instead of a specific class (e.g. Foo in the 
example)? I guess I can just skip the ParquetIO.parseGenericRecords transform? 
So do I still have to specify the dummy parsing logic like below? Thanks!

p.apply(AvroIO.parseGenericRecords(new SerializableFunction() {
   public Foo apply(GenericRecord record) {
 return record;
   }

From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Wednesday, January 6, 2021 at 10:13 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11460&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975572542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YpH3Rtz%2FcnE9LwfLzNyPOalaW8OUSL5sxffolKiOv%2Bk%3D&reserved=0>

Regards,
Alexey



On 6 Jan 2021, at 18:57, Tao Li mailto:t...@zillow.com>> wrote:

Hi beam community,

Quick question about 
ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cr5MTRb4cZCLof85nfPUxtMKGRQvhJ4zLPEJa7STEjM%3D&reserved=0>.
 Is there a way to avoid specifying the avro schema when reading parquet files? 
The reason is that we may not know the parquet schema until we read the files. 
In comparison, spark parquet 
reader<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-parquet.html&data=04%7C01%7Ctaol%40zillow.com%7C7309c049186b4f96709608d8b33592df%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637456389975582489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WJWWqx%2B4OLzHeypOs1Dyvlio9fg%2BXGGk1OgocJu3m8g%3D&reserved=0>
 does not require such a schema specification.

Please advise. Thanks a lot!




Re: Quick question regarding ParquetIO

2021-01-08 Thread Tao Li
Thanks Alexey for your explanation. That’s also what I was thinking. Parquet 
files already have the schema built in, so it might be feasible to infer a 
coder automatically (like spark parquet reader). It would be great if  we have 
some experts chime in here. @Brian Hulette<mailto:bhule...@google.com> already 
mentioned that the community is working on new DataFrame APIs in Python SDK, 
which are based on the pandas methods and use those methods at construction 
time to determine the schema. I think this is very close to the schema 
inference we have been discussing. Not sure it will be available to Java SDK 
though.


Regarding BEAM-11460, looks like it may not totally solve my problem. As 
@Alexey Romanenko<mailto:aromanenko@gmail.com> mentioned, we may still need 
to know the avro or beam schema for following operations after the parquet 
read. A dumb question is, with BEAM-11460, after we get a 
PCollection  from parquet read (without the need to specify avro 
schema), is it possible to get the attached avro schema from a GenericRecord 
element of this PCollection?

Really appreciate it if you can help clarify my questions. Thanks!



From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 8, 2021 at 4:48 AM
To: "user@beam.apache.org" 
Subject: Re: Quick question regarding ParquetIO

Well, this is how I see it, let me explain.

Since every PCollection is required to have a Coder to materialize the 
intermediate data, we need to have a coder for "PCollection" as 
well. If I’m not mistaken, for “GenericRecord" we used to set AvroCoder that is 
based on Avro (or Beam too?) schema.

Actually, currently it will throw an exception if you will try to use 
“parseGenericRecords()” with a PCollection as output pcollection 
since it can’t infer a Coder based on provided “parseFn”. I guess it was done 
intentially in this way and I doubt that we can have a proper coder for 
PCollection without knowing a schema. Maybe some Avro experts 
here can add more on this if we can somehow overcome it.


On 7 Jan 2021, at 19:44, Tao Li mailto:t...@zillow.com>> wrote:

Alexey,

Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to 
specify a schema when reading parquet files to get aPCollection. 
Is my understanding correct? Am I missing anything here?

Thanks!

From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Thursday, January 7, 2021 at 9:56 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

If you want to get just a PCollection as output then you would 
still need to set AvroCoder, but which schema to use in this case?



On 6 Jan 2021, at 19:53, Tao Li mailto:t...@zillow.com>> wrote:

Hi Alexey,

Thank you so much for this info. I will definitely give it a try once 2.28 is 
released.

Regarding this feature, it’s basically mimicking the feature from 
AvroIO:https://beam.apache.org/releases/javadoc/2.26.0/org/apache/beam/sdk/io/AvroIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.26.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.html&data=04%7C01%7Ctaol%40zillow.com%7Cdab3777011ed4b6e0ec708d8b3d3c2b5%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637457069377524619%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bouNPiimOXG8wvarFt2huIJ6cd8k5c2ekp2Sd4WqPjc%3D&reserved=0>

I have one more quick question regarding the “reading records of an unknown 
schema” scenario. In the sample code a PCollection is being returned and 
the parseGenericRecords requires a parsing logic. What if I just want to get a 
PCollection instead of a specific class (e.g. Foo in the 
example)? I guess I can just skip the ParquetIO.parseGenericRecords transform? 
So do I still have to specify the dummy parsing logic like below? Thanks!

p.apply(AvroIO.parseGenericRecords(new SerializableFunction() {
   public Foo apply(GenericRecord record) {
 return record;
   }

From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Wednesday, January 6, 2021 at 10:13 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Hi Tao,

This jira [1] looks exactly what you are asking but it was merged recently 
(thanks to Anant Damle for working on this!) and it should be available only in 
Beam 2.28.0.

[1] 
https://issues.apache.org/jira/browse/BEAM-11460<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.a

Is there an array explode function/transform?

2021-01-12 Thread Tao Li
Hi community,

Is there a beam function to explode an array (similarly to spark sql’s 
explode())? I did some research but did not find anything.

BTW I think we can potentially use FlatMap to implement the explode 
functionality, but a Beam provided function would be very handy.

Thanks a lot!


Re: Is there an array explode function/transform?

2021-01-12 Thread Tao Li
@Reuven Lax<mailto:re...@google.com> yes I am aware of that transform, but 
that’s different from the explode operation I was referring to: 
https://spark.apache.org/docs/latest/api/sql/index.html#explode

From: Reuven Lax 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, January 12, 2021 at 2:04 PM
To: user 
Subject: Re: Is there an array explode function/transform?

Have you tried Flatten.iterables

On Tue, Jan 12, 2021, 2:02 PM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi community,

Is there a beam function to explode an array (similarly to spark sql’s 
explode())? I did some research but did not find anything.

BTW I think we can potentially use FlatMap to implement the explode 
functionality, but a Beam provided function would be very handy.

Thanks a lot!


Re: Quick question regarding ParquetIO

2021-01-13 Thread Tao Li
@Kobe Feng<mailto:flllbls...@gmail.com> thank you so much for the insights. 
Agree that it may be a good practice to read all sorts of file formats (e.g. 
parquet, avro etc) into a PCollection and then perform the schema aware 
transforms that you are referring to.

The new dataframe APIs for Python SDK sound pretty cool and I can imagine it 
will save a lot of hassles during a beam app development. Hopefully it will be 
added to Java SDK as well.

From: Kobe Feng 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 8, 2021 at 11:39 AM
To: "user@beam.apache.org" 
Subject: Re: Quick question regarding ParquetIO

Tao,
I'm not an expert, and good intuition, all you want is schema awareness 
transformations or let's say schema based transformation in Beam not only for 
IO but also for other DoFn, etc, and possibly have schema revolution in future 
as well.

This is how I try to understand and explain in other places before:  Not like 
spark, flink to leverage internal/built-in types (e.g, catalyst struct type)  
for built-in operators as more as possible to infer the schema when IOs could 
convert to, beam is trying to have capable to handle any type during transforms 
for people to migrate existing ones to beam (Do spark map partition func with 
own type, Encoder can't be avoided as well, right). Also yes, we could leverage 
beam own type "Row" to do all transformations and converting all in/out types 
like parquet, avro, orc, etc at IO side, and then do schema inferring in 
built-in operators base on row type when we know they will operate on internal 
types, that's how to avoid the coder or explicit schema there, more further, 
provide IO for schema registry capability and then transform will lookup when 
necessary for the revolution. I saw beam put schema base transformation in 
goals last year which will be convenient for people (since normally people 
would rather use builtin types instead of providing their own types' coder for 
following operators until we have to), that's why dataframe APIs for python SDK 
here I think.

Kobe


On Fri, Jan 8, 2021 at 9:34 AM Tao Li mailto:t...@zillow.com>> 
wrote:
Thanks Alexey for your explanation. That’s also what I was thinking. Parquet 
files already have the schema built in, so it might be feasible to infer a 
coder automatically (like spark parquet reader). It would be great if  we have 
some experts chime in here. @Brian Hulette<mailto:bhule...@google.com> already 
mentioned that the community is working on new DataFrame APIs in Python SDK, 
which are based on the pandas methods and use those methods at construction 
time to determine the schema. I think this is very close to the schema 
inference we have been discussing. Not sure it will be available to Java SDK 
though.


Regarding BEAM-11460, looks like it may not totally solve my problem. As 
@Alexey Romanenko<mailto:aromanenko@gmail.com> mentioned, we may still need 
to know the avro or beam schema for following operations after the parquet 
read. A dumb question is, with BEAM-11460, after we get a 
PCollection  from parquet read (without the need to specify avro 
schema), is it possible to get the attached avro schema from a GenericRecord 
element of this PCollection?

Really appreciate it if you can help clarify my questions. Thanks!



From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Friday, January 8, 2021 at 4:48 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Quick question regarding ParquetIO

Well, this is how I see it, let me explain.

Since every PCollection is required to have a Coder to materialize the 
intermediate data, we need to have a coder for "PCollection" as 
well. If I’m not mistaken, for “GenericRecord" we used to set AvroCoder that is 
based on Avro (or Beam too?) schema.

Actually, currently it will throw an exception if you will try to use 
“parseGenericRecords()” with a PCollection as output pcollection 
since it can’t infer a Coder based on provided “parseFn”. I guess it was done 
intentially in this way and I doubt that we can have a proper coder for 
PCollection without knowing a schema. Maybe some Avro experts 
here can add more on this if we can somehow overcome it.

On 7 Jan 2021, at 19:44, Tao Li mailto:t...@zillow.com>> wrote:

Alexey,

Why do I need to set AvroCoder? I assume with BEAM-11460 we don’t need to 
specify a schema when reading parquet files to get aPCollection. 
Is my understanding correct? Am I missing anything here?

Thanks!

From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Thursday, January 7, 2021 at 9

Re: Is there an array explode function/transform?

2021-01-13 Thread Tao Li
@Kyle Weaver<mailto:kcwea...@google.com> sure thing! So the input/output 
definition for the 
Flatten.Iterables<https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/transforms/Flatten.Iterables.html>
 is:

Input: PCollection
Output: PCollection

The input/output for a explode transform would look like this:
Input:  PCollection The row schema has a field which is an array of T
Output: PCollection The array type field from input schema is replaced 
with a new field of type T. The elements from the array type field are 
flattened into multiple rows in the new table (other fields of input table are 
just duplicated.

Hope this clarification helps!

From: Kyle Weaver 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, January 12, 2021 at 4:58 PM
To: "user@beam.apache.org" 
Cc: Reuven Lax 
Subject: Re: Is there an array explode function/transform?

@Reuven Lax<mailto:re...@google.com> yes I am aware of that transform, but 
that’s different from the explode operation I was referring to: 
https://spark.apache.org/docs/latest/api/sql/index.html#explode<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fsql%2Findex.html%23explode&data=04%7C01%7Ctaol%40zillow.com%7C1226a5d9efee43fc7d5508d8b75e5bfd%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637460963191408293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IjXWhmHTGsbpgbxa1gJ5LcOFI%2BoiGIDYBwXPnukQfxk%3D&reserved=0>

How is it different? It'd help if you could provide the signature (input and 
output PCollection types) of the transform you have in mind.

On Tue, Jan 12, 2021 at 4:49 PM Tao Li 
mailto:t...@zillow.com>> wrote:
@Reuven Lax<mailto:re...@google.com> yes I am aware of that transform, but 
that’s different from the explode operation I was referring to: 
https://spark.apache.org/docs/latest/api/sql/index.html#explode<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fsql%2Findex.html%23explode&data=04%7C01%7Ctaol%40zillow.com%7C1226a5d9efee43fc7d5508d8b75e5bfd%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637460963191418249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XuUUmNB3fgBasjDj0Dq1Z2g6%2Bc5fbvluf%2BnAp2m8cuE%3D&reserved=0>

From: Reuven Lax mailto:re...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, January 12, 2021 at 2:04 PM
To: user mailto:user@beam.apache.org>>
Subject: Re: Is there an array explode function/transform?

Have you tried Flatten.iterables

On Tue, Jan 12, 2021, 2:02 PM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi community,

Is there a beam function to explode an array (similarly to spark sql’s 
explode())? I did some research but did not find anything.

BTW I think we can potentially use FlatMap to implement the explode 
functionality, but a Beam provided function would be very handy.

Thanks a lot!


Regarding the field ordering after Select.Flattened transform

2021-01-19 Thread Tao Li
Hi community,

I have been experimenting with Select.Flattened transform and noticed that the 
field order in the flattened schema is not consistent with the order of the top 
level fields from the original schema. For example, in the original schema, we 
have field “foo” as the first field and it has a nested field “bar”. After 
applying the flattening transform, the new field “foo.bar” becomes the last 
field in the flattened schema. Seems like the order of the new fields are not 
that deterministic in the flattened schema. Is this an expected behavior? Don’t 
we guarantee any ordering of the flattened fields (e.g. being consistent with 
the original order)? Thanks!


Re: Quick question regarding ParquetIO

2021-01-19 Thread Tao Li
t specify it) into your own type of objects.

In Parquet/Avro the schema you use to write can differ from the schema you 
use
to read, this is done to support schema evolution, so the most general use 
case
is to allow users to read from specific versions of the Schema provided into
their objects. That's probably one of the reasons why this is not supported.

Since the Schema is part of the Parquet file metadata I suppose we could 
somehow
use it and produce the Schema for the output collection, notice however 
that if
the schema differs on the files this will break in runtime.

Filled 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11650&data=04%7C01%7Ctaol%40zillow.com%7C7cc9c01c692c4c00b8b108d8bba2ea2a%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637465655684451869%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Uim3XsNtFa9ynXWixl1Vmm9gRu6%2F4LwrAAPLIha2oAQ%3D&reserved=0
 to track this.

On Wed, Jan 13, 2021 at 7:42 PM Tao Li  wrote:
>
> @Kobe Feng thank you so much for the insights. Agree that it may be a 
good practice to read all sorts of file formats (e.g. parquet, avro etc) into a 
PCollection and then perform the schema aware transforms that you are 
referring to.
>
>
>
> The new dataframe APIs for Python SDK sound pretty cool and I can imagine 
it will save a lot of hassles during a beam app development. Hopefully it will 
be added to Java SDK as well.
>
>
>
> From: Kobe Feng 
> Reply-To: "user@beam.apache.org" 
> Date: Friday, January 8, 2021 at 11:39 AM
> To: "user@beam.apache.org" 
> Subject: Re: Quick question regarding ParquetIO
>
>
>
> Tao,
> I'm not an expert, and good intuition, all you want is schema awareness 
transformations or let's say schema based transformation in Beam not only for 
IO but also for other DoFn, etc, and possibly have schema revolution in future 
as well.
>
>
> This is how I try to understand and explain in other places before:  Not 
like spark, flink to leverage internal/built-in types (e.g, catalyst struct 
type)  for built-in operators as more as possible to infer the schema when IOs 
could convert to, beam is trying to have capable to handle any type during 
transforms for people to migrate existing ones to beam (Do spark map partition 
func with own type, Encoder can't be avoided as well, right). Also yes, we 
could leverage beam own type "Row" to do all transformations and converting all 
in/out types like parquet, avro, orc, etc at IO side, and then do schema 
inferring in built-in operators base on row type when we know they will operate 
on internal types, that's how to avoid the coder or explicit schema there, more 
further, provide IO for schema registry capability and then transform will 
lookup when necessary for the revolution. I saw beam put schema base 
transformation in goals last year which will be convenient for people (since 
normally people would rather use builtin types instead of providing their own 
types' coder for following operators until we have to), that's why dataframe 
APIs for python SDK here I think.
>
> Kobe
>
>
>
>
> On Fri, Jan 8, 2021 at 9:34 AM Tao Li  wrote:
>
> Thanks Alexey for your explanation. That’s also what I was thinking. 
Parquet files already have the schema built in, so it might be feasible to 
infer a coder automatically (like spark parquet reader). It would be great if  
we have some experts chime in here. @Brian Hulette already mentioned that the 
community is working on new DataFrame APIs in Python SDK, which are based on 
the pandas methods and use those methods at construction time to determine the 
schema. I think this is very close to the schema inference we have been 
discussing. Not sure it will be available to Java SDK though.
>
>
>
> Regarding BEAM-11460, looks like it may not totally solve my problem. As 
@Alexey Romanenko mentioned, we may still need to know the avro or beam schema 
for following operations after the parquet read. A dumb question is, with 
BEAM-11460, after we get a PCollection  from parquet read 
(without the need to specify avro schema), is it possible to get the attached 
avro schema from a GenericRecord element of this PCollection?
>
>
>
> Really appreciate it if you can help clarify my questions. Thanks!
>
>
>
>
>
> From: Alexey Romanenko 
> Reply-To: "user@beam.apache.org" 
> Date: Friday, January 8, 2021 at 4:48 AM
> To: "user@beam.apache.org" 
> Subject: Re: Quick question regarding Parquet

Re: Regarding the field ordering after Select.Flattened transform

2021-01-20 Thread Tao Li
Hi Brian,

Thanks for your quick response. I totally agree that the we should not rely on 
any assumption on the field order and we can always specify the order of the 
flattened fields as we want. There is no blocker issue for me with the current 
behavior, but I am just wondering if may be convenient in some use cases if we 
can just keep the order (roughly) consistent with the order of the parent 
fields from the original schema.

From: Brian Hulette 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, January 20, 2021 at 9:42 AM
To: user 
Subject: Re: Regarding the field ordering after Select.Flattened transform

This does seem like an odd choice, I suspect this was just a matter of 
convenience of implementation since the javadoc makes no claims about field 
order.

In general schema transforms don't take care to maintain a particular field 
order and I'd recommend against relying on it. Instead fields should be 
addressed by name, either with Row.getValue(Sring), or by mapping to a user 
type. Is there a reason you want to rely on a particular field order? Maybe 
when writing to certain IOs field order could be important.

On Tue, Jan 19, 2021 at 1:36 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi community,

I have been experimenting with Select.Flattened transform and noticed that the 
field order in the flattened schema is not consistent with the order of the top 
level fields from the original schema. For example, in the original schema, we 
have field “foo” as the first field and it has a nested field “bar”. After 
applying the flattening transform, the new field “foo.bar” becomes the last 
field in the flattened schema. Seems like the order of the new fields are not 
that deterministic in the flattened schema. Is this an expected behavior? Don’t 
we guarantee any ordering of the flattened fields (e.g. being consistent with 
the original order)? Thanks!


Overwrite support from ParquetIO

2021-01-25 Thread Tao Li
Hi Beam community,

Does ParquetIO support an overwrite behavior when saving files? More 
specifically, I would like to wipe out all existing parquet files before a 
write operation. Is there a ParquetIO API to support that? Thanks!


Re: Overwrite support from ParquetIO

2021-01-27 Thread Tao Li
@Alexey Romanenko<mailto:aromanenko@gmail.com> thanks for your response. 
Regarding your questions:


  1.  Yes I can purge this directory (e.g. using s3 client from aws sdk) before 
using ParquetIO to save files. The caveat is that this deletion operation is 
not part of the beam pipeline, so it will kick off before the pipeline starts. 
More ideally, this purge operation could be baked into the write operation with 
ParquetIO so we will have the deletion happen right before the files writes.
  2.  Regarding the naming strategy, yes the old files will be overwritten by 
the new files if they have the same file names. However this does not always 
guarantee that all the old files in this directory are wiped out (which is 
actually my requirement). For example we may change the shard count (through 
withNumShards() method) in different pipeline runs and there could be old files 
from previous run that won’t get overwritten in the current run.

Please let me know if this makes sense to you. Thanks!


From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, January 27, 2021 at 9:10 AM
To: "user@beam.apache.org" 
Subject: Re: Overwrite support from ParquetIO

What do you mean by “wipe out all existing parquet files before a write 
operation”? Are these all files that already exist in the same output 
directory? Can you purge this directory before or just use a new output 
directory for every pipeline run?

To write Parquet files you need to use ParquetIO.sink() with FileIO.write() and 
I don’t think it will clean up the output directory before write. Though, if 
there are the name collisions between existing and new output files (it depends 
on used naming strategy) then I think the old files will be overwritten by new 
ones.




On 25 Jan 2021, at 19:10, Tao Li mailto:t...@zillow.com>> 
wrote:

Hi Beam community,

Does ParquetIO support an overwrite behavior when saving files? More 
specifically, I would like to wipe out all existing parquet files before a 
write operation. Is there a ParquetIO API to support that? Thanks!



Re: Overwrite support from ParquetIO

2021-01-27 Thread Tao Li
Thanks @Chamikara Jayalath<mailto:chamik...@google.com> I think it’s a good 
idea to define a DoFn for this deletion operation, or maybe a composite 
PTransform that does deletion first followed by ParquetIO.Write.

From: Chamikara Jayalath 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, January 27, 2021 at 3:45 PM
To: user 
Cc: Alexey Romanenko 
Subject: Re: Overwrite support from ParquetIO



On Wed, Jan 27, 2021 at 12:06 PM Tao Li 
mailto:t...@zillow.com>> wrote:
@Alexey Romanenko<mailto:aromanenko@gmail.com> thanks for your response. 
Regarding your questions:


  1.  Yes I can purge this directory (e.g. using s3 client from aws sdk) before 
using ParquetIO to save files. The caveat is that this deletion operation is 
not part of the beam pipeline, so it will kick off before the pipeline starts. 
More ideally, this purge operation could be baked into the write operation with 
ParquetIO so we will have the deletion happen right before the files writes.
  2.  Regarding the naming strategy, yes the old files will be overwritten by 
the new files if they have the same file names. However this does not always 
guarantee that all the old files in this directory are wiped out (which is 
actually my requirement). For example we may change the shard count (through 
withNumShards() method) in different pipeline runs and there could be old files 
from previous run that won’t get overwritten in the current run.

In general, Beam file-based sinks are intended  for writing new files. So I 
don't think existing file-based sinks (including Parquet) will work out of the 
box for replacing existing files or for appending to such files.
But you should be able to delete existing files separately, for example.
(1) As a function that is performed before executing the pipeline.
(2) As a function that is performed from a ParDo step that is executed before 
the ParquetIO.Write step. Also you will have to make sure that the runner does 
not fuse the ParDo step and the Write step. Usually, this can be done by 
performing it in a side-input step (to a ParDo that precedes sink) or by adding 
a GBK/Reshuffle between the two steps.

Thanks,
Cham



  1.

Please let me know if this makes sense to you. Thanks!


From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Wednesday, January 27, 2021 at 9:10 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Overwrite support from ParquetIO

What do you mean by “wipe out all existing parquet files before a write 
operation”? Are these all files that already exist in the same output 
directory? Can you purge this directory before or just use a new output 
directory for every pipeline run?

To write Parquet files you need to use ParquetIO.sink() with FileIO.write() and 
I don’t think it will clean up the output directory before write. Though, if 
there are the name collisions between existing and new output files (it depends 
on used naming strategy) then I think the old files will be overwritten by new 
ones.



On 25 Jan 2021, at 19:10, Tao Li mailto:t...@zillow.com>> 
wrote:

Hi Beam community,

Does ParquetIO support an overwrite behavior when saving files? More 
specifically, I would like to wipe out all existing parquet files before a 
write operation. Is there a ParquetIO API to support that? Thanks!



Potential bug with ParquetIO.read when reading arrays

2021-01-28 Thread Tao Li
Hi Beam community,

I am seeing an error when reading an array field using ParquetIO. I was using 
beam 2.25 and the direct runner for testing. Is this a bug or a known issue? Am 
I missing anything here? Please help me root cause this issue. Thanks so much!

Attached are the avro schema and the parquet file. Below is the schema tree as 
a quick visualization. The array field name is “list” and the element type is 
int. You can see this schema defined in the avsc file as well.

root
|-- list: array (nullable = true)
||-- element: integer (containsNull = true)

The beam code is very simple: 
pipeline.apply(ParquetIO.read(avroSchema).from(parquetPath));

Here is the error when running that code:

[direct-runner-worker] INFO 
shaded.org.apache.parquet.hadoop.InternalParquetRecordReader - block read in 
memory in 130 ms. row count = 1
Exception in thread "main" 
org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot 
be cast to java.lang.Number
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:353)
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:321)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:216)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
Caused by: java.lang.ClassCastException: 
org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Number
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:234)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:206)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at 
org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at 
org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.(MutationDetectors.java:115)
at 
org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:46)
at 
org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:112)
at 
org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output(ParDoEvaluator.java:301)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:267)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:79)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:413)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:401)
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ReadFiles$ReadFn.processElement(ParquetIO.java:646)




array-schema.avsc
Description: array-schema.avsc


part-00039-90fcc772-afa7-4947-b735-1c87683b26fd-c000.snappy.parquet
Description:  part-00039-90fcc772-afa7-4947-b735-1c87683b26fd-c000.snappy.parquet


Re: Potential bug with ParquetIO.read when reading arrays

2021-01-28 Thread Tao Li
BTW I tried avro 1.8 and 1.9 and both have the same error. So we can probably 
rule out any avro issue.

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 28, 2021 at 9:07 AM
To: "user@beam.apache.org" 
Subject: Potential bug with ParquetIO.read when reading arrays

Hi Beam community,

I am seeing an error when reading an array field using ParquetIO. I was using 
beam 2.25 and the direct runner for testing. Is this a bug or a known issue? Am 
I missing anything here? Please help me root cause this issue. Thanks so much!

Attached are the avro schema and the parquet file. Below is the schema tree as 
a quick visualization. The array field name is “list” and the element type is 
int. You can see this schema defined in the avsc file as well.

root
|-- list: array (nullable = true)
||-- element: integer (containsNull = true)

The beam code is very simple: 
pipeline.apply(ParquetIO.read(avroSchema).from(parquetPath));

Here is the error when running that code:

[direct-runner-worker] INFO 
shaded.org.apache.parquet.hadoop.InternalParquetRecordReader - block read in 
memory in 130 ms. row count = 1
Exception in thread "main" 
org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot 
be cast to java.lang.Number
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:353)
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:321)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:216)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
Caused by: java.lang.ClassCastException: 
org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Number
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:234)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:206)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at 
org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at 
org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.(MutationDetectors.java:115)
at 
org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:46)
at 
org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:112)
at 
org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output(ParDoEvaluator.java:301)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:267)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:79)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:413)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:401)
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ReadFiles$ReadFn.processElement(ParquetIO.java:646)




Re: Overwrite support from ParquetIO

2021-01-28 Thread Tao Li
Thanks everyone for your inputs here! Really helpful information!

From: Chamikara Jayalath 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 28, 2021 at 10:54 AM
To: user 
Subject: Re: Overwrite support from ParquetIO



On Thu, Jan 28, 2021 at 9:14 AM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:
1. Personally, I’d recommend to purge the output directory (if it’s needed, of 
course) before starting your pipeline as a part of your driver program and not 
in DoFn since, as Reuven mentioned before, to avoid potential side effects. 
Another option is to write files into the new directory with uniq name and 
then, after your pipeline has been finished, atomically rename it. Though, of 
course the final solution depends on internals of your application and 
environment.

Imho, FS manipulations (like this) should be a part of driver program and not a 
distributed data processing pipeline where it can be quite tricky to do 
reliably.

2. Yes, for sure we can’t rely on the fact that the old files will be 
overwritten by new files. Even more, we need to make sure that they won’t be 
overwritten to guarantee that we won’t lose them unexpectedly.

+1. Also note that due to dynamic  work rebalancing, file names might not 
exactly match only the prefix will match. So the two runs of the same pipeline, 
even on the same input, might produce a different number of shards (hence a 
different number of filenames with the same prefix).



On 27 Jan 2021, at 21:06, Tao Li mailto:t...@zillow.com>> 
wrote:

@Alexey Romanenko<mailto:aromanenko@gmail.com> thanks for your response. 
Regarding your questions:


  1.  Yes I can purge this directory (e.g. using s3 client from aws sdk) before 
using ParquetIO to save files. The caveat is that this deletion operation is 
not part of the beam pipeline, so it will kick off before the pipeline starts. 
More ideally, this purge operation could be baked into the write operation with 
ParquetIO so we will have the deletion happen right before the files writes.
  2.  Regarding the naming strategy, yes the old files will be overwritten by 
the new files if they have the same file names. However this does not always 
guarantee that all the old files in this directory are wiped out (which is 
actually my requirement). For example we may change the shard count (through 
withNumShards() method) in different pipeline runs and there could be old files 
from previous run that won’t get overwritten in the current run.

Please let me know if this makes sense to you. Thanks!


From: Alexey Romanenko 
mailto:aromanenko@gmail.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Wednesday, January 27, 2021 at 9:10 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Overwrite support from ParquetIO

What do you mean by “wipe out all existing parquet files before a write 
operation”? Are these all files that already exist in the same output 
directory? Can you purge this directory before or just use a new output 
directory for every pipeline run?

To write Parquet files you need to use ParquetIO.sink() with FileIO.write() and 
I don’t think it will clean up the output directory before write. Though, if 
there are the name collisions between existing and new output files (it depends 
on used naming strategy) then I think the old files will be overwritten by new 
ones.



On 25 Jan 2021, at 19:10, Tao Li mailto:t...@zillow.com>> 
wrote:

Hi Beam community,

Does ParquetIO support an overwrite behavior when saving files? More 
specifically, I would like to wipe out all existing parquet files before a 
write operation. Is there a ParquetIO API to support that? Thanks!



Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
Hi community,

Can someone take a look at this issue? It is kind of a blocker to me right now. 
Really appreciate your help!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 28, 2021 at 6:13 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

BTW I tried avro 1.8 and 1.9 and both have the same error. So we can probably 
rule out any avro issue.

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 28, 2021 at 9:07 AM
To: "user@beam.apache.org" 
Subject: Potential bug with ParquetIO.read when reading arrays

Hi Beam community,

I am seeing an error when reading an array field using ParquetIO. I was using 
beam 2.25 and the direct runner for testing. Is this a bug or a known issue? Am 
I missing anything here? Please help me root cause this issue. Thanks so much!

Attached are the avro schema and the parquet file. Below is the schema tree as 
a quick visualization. The array field name is “list” and the element type is 
int. You can see this schema defined in the avsc file as well.

root
|-- list: array (nullable = true)
||-- element: integer (containsNull = true)

The beam code is very simple: 
pipeline.apply(ParquetIO.read(avroSchema).from(parquetPath));

Here is the error when running that code:

[direct-runner-worker] INFO 
shaded.org.apache.parquet.hadoop.InternalParquetRecordReader - block read in 
memory in 130 ms. row count = 1
Exception in thread "main" 
org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot 
be cast to java.lang.Number
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:353)
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:321)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:216)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
Caused by: java.lang.ClassCastException: 
org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Number
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:234)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:206)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at 
org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at 
org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.(MutationDetectors.java:115)
at 
org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:46)
at 
org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:112)
at 
org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output(ParDoEvaluator.java:301)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:267)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:79)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:413)
at 
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.ou

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
OK I think this issue is due to incompatibility between the parquet files 
(created with spark 2.4) and parquet version as a dependency of ParquetIO 2.25. 
It seems working after I switch to spark runner (from direct runner) and run 
the beam app in a spark cluster. I assume by doing this I am basically using 
parquet jars from spark distributable directly and now everything is compatible.

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 29, 2021 at 7:45 AM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi community,

Can someone take a look at this issue? It is kind of a blocker to me right now. 
Really appreciate your help!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 28, 2021 at 6:13 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

BTW I tried avro 1.8 and 1.9 and both have the same error. So we can probably 
rule out any avro issue.

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, January 28, 2021 at 9:07 AM
To: "user@beam.apache.org" 
Subject: Potential bug with ParquetIO.read when reading arrays

Hi Beam community,

I am seeing an error when reading an array field using ParquetIO. I was using 
beam 2.25 and the direct runner for testing. Is this a bug or a known issue? Am 
I missing anything here? Please help me root cause this issue. Thanks so much!

Attached are the avro schema and the parquet file. Below is the schema tree as 
a quick visualization. The array field name is “list” and the element type is 
int. You can see this schema defined in the avsc file as well.

root
|-- list: array (nullable = true)
||-- element: integer (containsNull = true)

The beam code is very simple: 
pipeline.apply(ParquetIO.read(avroSchema).from(parquetPath));

Here is the error when running that code:

[direct-runner-worker] INFO 
shaded.org.apache.parquet.hadoop.InternalParquetRecordReader - block read in 
memory in 130 ms. row count = 1
Exception in thread "main" 
org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot 
be cast to java.lang.Number
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:353)
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:321)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:216)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
Caused by: java.lang.ClassCastException: 
org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Number
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:234)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:206)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at 
org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at 
org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at 
org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.(MutationDetectors.java:115)
at 
org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:46)
at 
org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:112)
at 
org.apache.beam.runners.di

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
@Chamikara Jayalath<mailto:chamik...@google.com> Sorry about the confusion. But 
I did more testing and using the spark runner actually yields the same error:

java.lang.ClassCastException: shaded.org.apache.avro.generic.GenericData$Record 
cannot be cast to java.lang.Number
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:73)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:37)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:591)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:582)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:542)
at 
org.apache.beam.runners.spark.coders.CoderHelpers.toByteArray(CoderHelpers.java:55)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$groupByKeyAndWindow$c9b6f5c4$1(GroupNonMergingWindowsFunctions.java:86)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$bringWindowToKey$0(GroupNonMergingWindowsFunctions.java:129)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Iterators$6.transform(Iterators.java:785)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

From: Chamikara Jayalath 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 29, 2021 at 10:53 AM
To: user 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Thanks. It might be something good to document in case other users run into 
this as well. Can you file a JIRA with the details ?


On Fri, Jan 29, 2021 at 10:47 AM Tao Li 
mailto:t...@zillow.com>> wrote:
OK I think this issue is due to incompatibility between the parquet files 
(created with spark 2.4) and parquet version as a dependency of ParquetIO 2.25. 
It seems working after I switch to spark runner (from direct runner) and run 
the beam app in a spark cluster. I assume by doing this I am basically using 
parquet jars from spark distributable directly and now everything is compatible.

From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Friday, January 29, 2021 at 7:45 AM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi community,

Can someone tak

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
Thanks @Chamikara Jayalath<mailto:chamik...@google.com> I created this jira: 
https://issues.apache.org/jira/browse/BEAM-11721

From: Chamikara Jayalath 
Date: Friday, January 29, 2021 at 2:47 PM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Sounds like a bug. I think JIRA with a test case will still be helpful.

On Fri, Jan 29, 2021 at 2:33 PM Tao Li 
mailto:t...@zillow.com>> wrote:
@Chamikara Jayalath<mailto:chamik...@google.com> Sorry about the confusion. But 
I did more testing and using the spark runner actually yields the same error:

java.lang.ClassCastException: shaded.org.apache.avro.generic.GenericData$Record 
cannot be cast to java.lang.Number
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:73)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:37)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:591)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:582)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:542)
at 
org.apache.beam.runners.spark.coders.CoderHelpers.toByteArray(CoderHelpers.java:55)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$groupByKeyAndWindow$c9b6f5c4$1(GroupNonMergingWindowsFunctions.java:86)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$bringWindowToKey$0(GroupNonMergingWindowsFunctions.java:129)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Iterators$6.transform(Iterators.java:785)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

From: Chamikara Jayalath mailto:chamik...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Friday, January 29, 2021 at 10:53 AM
To: user mailto:user@beam.apache.org>>
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Thanks. It might be something good to document in case other users run into 
this as well. Can you file a JIRA with the details ?


On Fri, Jan 29, 2021 at 10:47 AM Tao Li 
mailto:t...@zillow.com>> wrote:
OK I think this issue is due to incompatibility between the parquet files 
(created with spark 2.4) and parquet version as a dependency of ParquetIO 2.25. 
It seems working after I switch t

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-30 Thread Tao Li
Hi all,

I have made a good progress to figure out the root cause to this issue. The 
details are in BEAM-11721. I asked some questions at the end of the jira and I 
am just duplicating it here for visibility. Thanks so much for the help and 
support from the community!


  1.  It's not quite intuitive to create a avro schema for ParquetIO, which 
contains spark defined fields ("list", "element" etc), when we are ingesting 
spark created parquet files. Is it possible to support the standard avro 
definition for the array type like 
(“type":"array","elementType":"integer","containsNull":true”)? Can beam do the 
schema translation under the hood to avoid the hassle for the users?
  2.  Taking a step back, why does ParquetIO require an avro schema 
specification, while AvroParquetReader does not actually require the schema? I 
briefly looked at the ParquetIO source code but has not figured it out yet.



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 29, 2021 at 3:37 PM
To: Chamikara Jayalath , "user@beam.apache.org" 

Subject: Re: Potential bug with ParquetIO.read when reading arrays

Thanks @Chamikara Jayalath<mailto:chamik...@google.com> I created this jira: 
https://issues.apache.org/jira/browse/BEAM-11721

From: Chamikara Jayalath 
Date: Friday, January 29, 2021 at 2:47 PM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Sounds like a bug. I think JIRA with a test case will still be helpful.

On Fri, Jan 29, 2021 at 2:33 PM Tao Li 
mailto:t...@zillow.com>> wrote:
@Chamikara Jayalath<mailto:chamik...@google.com> Sorry about the confusion. But 
I did more testing and using the spark runner actually yields the same error:

java.lang.ClassCastException: shaded.org.apache.avro.generic.GenericData$Record 
cannot be cast to java.lang.Number
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:73)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:37)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:591)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:582)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:542)
at 
org.apache.beam.runners.spark.coders.CoderHelpers.toByteArray(CoderHelpers.java:55)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$groupByKeyAndWindow$c9b6f5c4$1(GroupNonMergingWindowsFunctions.java:86)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$bringWindowToKey$0(GroupNonMergingWindowsFunctions.java:129)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Iterators$6.transform(Iterators.java:785)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-30 Thread Tao Li
Please let me rephrase my question. It's understandable that it might be a good 
practice to specify avro schema when reading parquet files (e.g. to support 
schema evolution etc). But sometimes the overhead is more than the benefits. 
Given that AvroParquetReader does not require an avro schema, is it possible to 
make the avro schema specification optional for ParquetIO.read? Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 1:54 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi all,

I have made a good progress to figure out the root cause to this issue. The 
details are in BEAM-11721. I asked some questions at the end of the jira and I 
am just duplicating it here for visibility. Thanks so much for the help and 
support from the community!


  1.  It's not quite intuitive to create a avro schema for ParquetIO, which 
contains spark defined fields ("list", "element" etc), when we are ingesting 
spark created parquet files. Is it possible to support the standard avro 
definition for the array type like 
(“type":"array","elementType":"integer","containsNull":true”)? Can beam do the 
schema translation under the hood to avoid the hassle for the users?
  2.  Taking a step back, why does ParquetIO require an avro schema 
specification, while AvroParquetReader does not actually require the schema? I 
briefly looked at the ParquetIO source code but has not figured it out yet.



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 29, 2021 at 3:37 PM
To: Chamikara Jayalath , "user@beam.apache.org" 

Subject: Re: Potential bug with ParquetIO.read when reading arrays

Thanks @Chamikara Jayalath<mailto:chamik...@google.com> I created this jira: 
https://issues.apache.org/jira/browse/BEAM-11721

From: Chamikara Jayalath 
Date: Friday, January 29, 2021 at 2:47 PM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Sounds like a bug. I think JIRA with a test case will still be helpful.

On Fri, Jan 29, 2021 at 2:33 PM Tao Li 
mailto:t...@zillow.com>> wrote:
@Chamikara Jayalath<mailto:chamik...@google.com> Sorry about the confusion. But 
I did more testing and using the spark runner actually yields the same error:

java.lang.ClassCastException: shaded.org.apache.avro.generic.GenericData$Record 
cannot be cast to java.lang.Number
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
shaded.org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at 
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:317)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:73)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:37)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:591)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:582)
at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:542)
at 
org.apache.beam.runners.spark.coders.CoderHelpers.toByteArray(CoderHelpers.java:55)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$groupByKeyAndWindow$c9b6f5c4$1(GroupNonMergingWindowsFunctions.java:86)
at 
org.apache.beam.runners.spark.translation.GroupNonMergingWindowsFunctions.lambda$bringWindowToKey$0(GroupNonMergingWindowsFunctions.java:129)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Iterators$6.transform(Iterators.java:785)
at 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Tran

Re: Potential bug with ParquetIO.read when reading arrays

2021-02-03 Thread Tao Li
Hi all,

Thanks for all the discussions so far (including discussions in BEAM-11721 and 
offline discussions). We will use BEAM-11650 to track the request of making 
avro schema optional for ParquetIO.read operation. I can potentially work on 
that ticket later.

There is another issue that I hope to get some help with from the beam 
community. I have posted this question on beam slack channels but I am 
duplicating it here for visibility. Basically I am using ParquetIO (which uses 
AvroParquetReader) to read spark created parquet files (please see attached). 
The inspection result is below. You can see the spark schema is very simple, 
which is just a field of an array of integers:

creator: parquet-mr version 1.10.1 (build 
815bcfa4a4aacf66d207b3dc692150d16b5740b9)
extra: org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"numbers","type":
{"type":"array","elementType":"integer","containsNull":true}
,"nullable":true,"metadata":{}}]}
file schema: spark_schema

numbers: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL INT32 R:1 D:3


When I use ParquetIO to read this file, the Avro schema for the 
PCollection becomes:
{
  "type": "record",
 "name": "spark_schema",
  "fields": [
{
  "name": "numbers",
  "type": [
"null",
{
  "type": "array",
  "items": {
"type": "record",
"name": "list",
"fields": [
  {
"name": "element",
"type": [
  "null",
  "int"
],
"default": null
  }
]
  }
}
  ],
  "default": null
}
  ]
}


You can see that the schema becomes an array of record type (which contains a 
"element" field). The reason is probably that internally spark parquet is 
defining a “list” record type.

The problem is that this avro schema is not the one I want deal with in the 
following beam transforms. Instead I want to retain the original schema defined 
in spark which is simply an array of integers. Is there an easy way to retain 
the original schema when using ParquetIO to read spark created fields? Did 
anyone run into this need? Please advise. Thanks!



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 11:21 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Please let me rephrase my question. It's understandable that it might be a good 
practice to specify avro schema when reading parquet files (e.g. to support 
schema evolution etc). But sometimes the overhead is more than the benefits. 
Given that AvroParquetReader does not require an avro schema, is it possible to 
make the avro schema specification optional for ParquetIO.read? Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 1:54 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi all,

I have made a good progress to figure out the root cause to this issue. The 
details are in BEAM-11721. I asked some questions at the end of the jira and I 
am just duplicating it here for visibility. Thanks so much for the help and 
support from the community!


  1.  It's not quite intuitive to create a avro schema for ParquetIO, which 
contains spark defined fields ("list", "element" etc), when we are ingesting 
spark created parquet files. Is it possible to support the standard avro 
definition for the array type like 
(“type":"array","elementType":"integer","containsNull":true”)? Can beam do the 
schema translation under the hood to avoid the hassle for the users?
  2.  Taking a step back, why does ParquetIO require an avro schema 
specification, while AvroParquetReader does not actually require the schema? I 
briefly looked at the ParquetIO source code but has not figured it out yet.



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, January 29, 2021 at 3:37 PM
To: Chamikara Jayalath , "user@beam.apache.org" 

Subject: Re: Potential bug with ParquetIO.read when reading arrays

Thanks @Chamikara Jayalath<mailto:chamik...@google.com> I created this jira: 
https://issues.apache.org/jira/browse/BEAM-11721

From: Chamikara Jayalath 
Date: Friday, January 29, 2021 at 2:47 PM
To: Tao Li 
Cc: "user

Re: Potential bug with ParquetIO.read when reading arrays

2021-02-03 Thread Tao Li
I am also wondering if leveraging this parquet setting 
"parquet.avro.add-list-element-records" along with 
BEAM-11527<https://issues.apache.org/jira/browse/BEAM-11527> can solve my 
problem...

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, February 3, 2021 at 11:55 AM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi all,

Thanks for all the discussions so far (including discussions in BEAM-11721 and 
offline discussions). We will use BEAM-11650 to track the request of making 
avro schema optional for ParquetIO.read operation. I can potentially work on 
that ticket later.

There is another issue that I hope to get some help with from the beam 
community. I have posted this question on beam slack channels but I am 
duplicating it here for visibility. Basically I am using ParquetIO (which uses 
AvroParquetReader) to read spark created parquet files (please see attached). 
The inspection result is below. You can see the spark schema is very simple, 
which is just a field of an array of integers:

creator: parquet-mr version 1.10.1 (build 
815bcfa4a4aacf66d207b3dc692150d16b5740b9)
extra: org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"numbers","type":
{"type":"array","elementType":"integer","containsNull":true}
,"nullable":true,"metadata":{}}]}
file schema: spark_schema

numbers: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL INT32 R:1 D:3


When I use ParquetIO to read this file, the Avro schema for the 
PCollection becomes:
{
  "type": "record",
 "name": "spark_schema",
  "fields": [
{
  "name": "numbers",
  "type": [
"null",
{
  "type": "array",
  "items": {
"type": "record",
"name": "list",
"fields": [
  {
"name": "element",
"type": [
  "null",
  "int"
],
"default": null
  }
]
  }
}
  ],
  "default": null
}
  ]
}


You can see that the schema becomes an array of record type (which contains a 
"element" field). The reason is probably that internally spark parquet is 
defining a “list” record type.

The problem is that this avro schema is not the one I want deal with in the 
following beam transforms. Instead I want to retain the original schema defined 
in spark which is simply an array of integers. Is there an easy way to retain 
the original schema when using ParquetIO to read spark created fields? Did 
anyone run into this need? Please advise. Thanks!



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 11:21 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Please let me rephrase my question. It's understandable that it might be a good 
practice to specify avro schema when reading parquet files (e.g. to support 
schema evolution etc). But sometimes the overhead is more than the benefits. 
Given that AvroParquetReader does not require an avro schema, is it possible to 
make the avro schema specification optional for ParquetIO.read? Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 1:54 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi all,

I have made a good progress to figure out the root cause to this issue. The 
details are in BEAM-11721. I asked some questions at the end of the jira and I 
am just duplicating it here for visibility. Thanks so much for the help and 
support from the community!


  1.  It's not quite intuitive to create a avro schema for ParquetIO, which 
contains spark defined fields ("list", "element" etc), when we are ingesting 
spark created parquet files. Is it possible to support the standard avro 
definition for the array type like 
(“type":"array","elementType":"integer","containsNull":true”)? Can beam do the 
schema translation under the hood to avoid the hassle for the users?
  2.  Taking a step back, why does ParquetIO require an avro schema 
specification, while AvroParquetReader does not actually require the schema? I 
briefly looked at the ParquetIO source code but has not figured it out yet.



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Da

Regarding Beam 2.28 release timeline

2021-02-17 Thread Tao Li
Hi Beam community,

I am looking forward to Beam 2.28 release which will probably include 
BEAM-11527. We will depend on 
BEAM-11527 for a major work 
item from my side. Can some please provide an ETA of 2.28 release? Thanks so 
much!



Re: Apache Beam's UX Research Findings Readout

2021-02-18 Thread Tao Li
Hi @Carlos Camacho is there a recording of 
this meeting? Thanks!

From: Carlos Camacho 
Reply-To: "user@beam.apache.org" 
Date: Thursday, February 11, 2021 at 9:06 AM
To: "user@beam.apache.org" 
Subject: Apache Beam's UX Research Findings Readout

Hi everyone,
This is a friendly reminder to join the UX Research Findings Readout.

We are live now! Join us: 
⁨https://meet.google.com/xfc-majk-byk⁩

--

Carlos Camacho | WIZELINE

UX Designer

carlos.cama...@wizeline.com

Amado Nervo 2200, Esfera P6, Col. Jardines del Sol, 45050 Zapopan, Jal.

Follow us 
@WizelineGlobal
 | 
Facebook
 | 
LinkedIn

Share feedback on 
Clutch

This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.


Potential bug with BEAM-11460?

2021-02-23 Thread Tao Li
Hi Beam community,

I cannot log into Beam jira so I am asking this question here. I am testing 
this new feature from Beam 2.28 and see below error:

Exception in thread "main" java.lang.IllegalArgumentException: Unable to infer 
coder for output of parseFn. Specify it explicitly using withCoder().
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ParseFiles.inferCoder(ParquetIO.java:554)
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ParseFiles.expand(ParquetIO.java:521)
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ParseFiles.expand(ParquetIO.java:483)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:547)

However ParquetIO builder does not have this withCoder() method. I think this 
error message is mimicking AvroIO: 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1010

Should we add this method to ParquetIO? Thanks!


How to specify "fs.s3.enableServerSideEncryption" in Beam

2021-02-24 Thread Tao Li
Hi Beam community,

We need to specify the "fs.s3.enableServerSideEncryption" setting when saving 
parquet files to s3. This doc describes this setting 
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-emrfs-encryption.html

What would be the recommended way to set that? Please advise. Thanks!


Re: How to specify "fs.s3.enableServerSideEncryption" in Beam

2021-02-24 Thread Tao Li
Just found this 
https://github.com/apache/beam/blob/master/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/options/S3Options.java#L70

Is this the right approach? Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, February 24, 2021 at 2:28 PM
To: "user@beam.apache.org" 
Subject: How to specify "fs.s3.enableServerSideEncryption" in Beam

Hi Beam community,

We need to specify the "fs.s3.enableServerSideEncryption" setting when saving 
parquet files to s3. This doc describes this setting 
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-emrfs-encryption.html

What would be the recommended way to set that? Please advise. Thanks!


Re: Potential bug with BEAM-11460?

2021-02-25 Thread Tao Li
@Brian Hulette<mailto:bhule...@google.com> I think the main issue I am trying 
to reporting is that I see this error message “Specify it explicitly using 
withCoder().” But I did not find withCoder() API available from ParquetIO. So 
maybe we need to add that method.
Getting back to your ask, here is roughly the code I was running. Hope this 
helps.
PCollection inputDataTest = 
pipeline.apply(ParquetIO.parseGenericRecords(new 
SerializableFunction() {
public Row apply(GenericRecord record) {
return AvroUtils.toBeamRowStrict(record, null);
}
})
.from(path));





From: Brian Hulette 
Reply-To: "user@beam.apache.org" 
Date: Thursday, February 25, 2021 at 3:11 PM
To: Anant Damle 
Cc: user 
Subject: Re: Potential bug with BEAM-11460?

Hi Tao,
Thanks for reporting this! Could you share more details about your use-case, 
Anant mentioned that he's having trouble coming up with a test case where 
inferCoder doesn't work [1].

Brian

[1] 
https://github.com/apache/beam/pull/14078#issuecomment-786293576<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fpull%2F14078%23issuecomment-786293576&data=04%7C01%7Ctaol%40zillow.com%7C28c92736981d44cd247a08d8d9e2b033%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637498914935965028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Pz6zNIBaqVVl9aKmnn27iVSxLXJ%2Fx6ly5NXHgY6TCyI%3D&reserved=0>

On Wed, Feb 24, 2021 at 6:49 PM Anant Damle 
mailto:ana...@google.com>> wrote:
Hi Brian,
I think you are right. Create 
BEAM-11861<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11861&data=04%7C01%7Ctaol%40zillow.com%7C28c92736981d44cd247a08d8d9e2b033%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637498914935974981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6bWEhniOJB8MVYHWMN0oZlGex40DarKoq%2FzIZUv0ARQ%3D&reserved=0>,
 will send a PR today.
Present workaround is to provide .setCoder directly on the Output PCollection.

On Thu, Feb 25, 2021 at 5:25 AM Brian Hulette 
mailto:bhule...@google.com>> wrote:
+Anant Damle<mailto:ana...@google.com> is this an oversight in 
https://github.com/apache/beam/pull/13616<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fpull%2F13616&data=04%7C01%7Ctaol%40zillow.com%7C28c92736981d44cd247a08d8d9e2b033%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637498914935974981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R87jcebbsRfqpnOBYY%2F8YYD5Hd82GK9EGGUFyfjSO7s%3D&reserved=0>?
 What would be the right way to fix this?

On Tue, Feb 23, 2021 at 5:24 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I cannot log into Beam jira so I am asking this question here. I am testing 
this new feature from Beam 2.28 and see below error:

Exception in thread "main" java.lang.IllegalArgumentException: Unable to infer 
coder for output of parseFn. Specify it explicitly using withCoder().
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ParseFiles.inferCoder(ParquetIO.java:554)
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ParseFiles.expand(ParquetIO.java:521)
at 
org.apache.beam.sdk.io.parquet.ParquetIO$ParseFiles.expand(ParquetIO.java:483)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:547)

However ParquetIO builder does not have this withCoder() method. I think this 
error message is mimicking AvroIO: 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1010<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fjava%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FAvroIO.java%23L1010&data=04%7C01%7Ctaol%40zillow.com%7C28c92736981d44cd247a08d8d9e2b033%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637498914935974981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zXKg%2BaLY1sGDFL%2FkS9a0%2B6MzMSjCaMxOubZr3XSicI0%3D&reserved=0>

Should we add this method to ParquetIO? Thanks!


Re: Potential bug with BEAM-11460?

2021-02-26 Thread Tao Li
Thanks @Anant Damle<mailto:ana...@google.com> for fixing the issue with 
BEAM-11460 and BEAM-11527 so quickly!

From: Anant Damle 
Date: Friday, February 26, 2021 at 6:49 AM
To: Tao Li 
Cc: "user@beam.apache.org" , Brian Hulette 

Subject: Re: Potential bug with BEAM-11460?

@Tao Li, I have added the Unit Test for your use-case and in this 
commit<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fpull%2F14078%2Fcommits%2Ff5459bb3533194de48712229957a555ef79f17ef&data=04%7C01%7Ctaol%40zillow.com%7C9108a61d8cf34535ea4708d8da65b2ce%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637499477623820239%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Weg3VhsVq45SYPdYxJc0EVjCXBlvQJOMvmveqxKYIRY%3D&reserved=0>.

On Fri, Feb 26, 2021 at 10:13 PM Anant Damle 
mailto:ana...@google.com>> wrote:
Thanks Tao,
Let me try and put this as a test-case.
I am also looking into 
BEAM-11527<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11527&data=04%7C01%7Ctaol%40zillow.com%7C9108a61d8cf34535ea4708d8da65b2ce%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637499477623820239%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BC88BEp985JnhebqzzRmEaiWuhn5MmVOWnn9DqfgcLQ%3D&reserved=0>.

Thanks,
Anant

On Fri, Feb 26, 2021 at 9:30 AM Tao Li 
mailto:t...@zillow.com>> wrote:
@Brian Hulette<mailto:bhule...@google.com> I think the main issue I am trying 
to reporting is that I see this error message “Specify it explicitly using 
withCoder().” But I did not find withCoder() API available from ParquetIO. So 
maybe we need to add that method.
Getting back to your ask, here is roughly the code I was running. Hope this 
helps.
PCollection inputDataTest = 
pipeline.apply(ParquetIO.parseGenericRecords(new 
SerializableFunction() {
public Row apply(GenericRecord record) {
return AvroUtils.toBeamRowStrict(record, null);
}
})
.from(path));





From: Brian Hulette mailto:bhule...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Thursday, February 25, 2021 at 3:11 PM
To: Anant Damle mailto:ana...@google.com>>
Cc: user mailto:user@beam.apache.org>>
Subject: Re: Potential bug with BEAM-11460?

Hi Tao,
Thanks for reporting this! Could you share more details about your use-case, 
Anant mentioned that he's having trouble coming up with a test case where 
inferCoder doesn't work [1].

Brian

[1] 
https://github.com/apache/beam/pull/14078#issuecomment-786293576<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fpull%2F14078%23issuecomment-786293576&data=04%7C01%7Ctaol%40zillow.com%7C9108a61d8cf34535ea4708d8da65b2ce%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637499477623830201%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jCJzq6jPAzlgIAqbUMaRIBVHeMuXZG4450fNsCpb61c%3D&reserved=0>

On Wed, Feb 24, 2021 at 6:49 PM Anant Damle 
mailto:ana...@google.com>> wrote:
Hi Brian,
I think you are right. Create 
BEAM-11861<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11861&data=04%7C01%7Ctaol%40zillow.com%7C9108a61d8cf34535ea4708d8da65b2ce%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637499477623830201%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7Yy66M9t%2FDD2vQTjTm7pYKTScWlp%2BbRyJ7wz5TYZkY8%3D&reserved=0>,
 will send a PR today.
Present workaround is to provide .setCoder directly on the Output PCollection.

On Thu, Feb 25, 2021 at 5:25 AM Brian Hulette 
mailto:bhule...@google.com>> wrote:
+Anant Damle<mailto:ana...@google.com> is this an oversight in 
https://github.com/apache/beam/pull/13616<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fpull%2F13616&data=04%7C01%7Ctaol%40zillow.com%7C9108a61d8cf34535ea4708d8da65b2ce%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637499477623840148%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jSuJ1kPMoE4vje6a6RNOnQxBluiBT1Pq7gPg5hOJlac%3D&reserved=0>?
 What would be the right way to fix this?

On Tue, Feb 23, 2021 at 5:24 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I cannot log into Beam jira so I am asking this question here. I am testing 
this new feature from Beam 2.28 and see below error:

Exception in thread "main" java.lang.IllegalArgumentException: 

Re: Potential bug with ParquetIO.read when reading arrays

2021-02-26 Thread Tao Li
Hi all,

Just a quick update. With the BEAM-11527 (from Beam 2.28 release), I am now 
able to specify ""parquet.avro.add-list-element-records" " setting to address 
this interoperability issue when using beam to read spark created files. 
Details are tracked in BEAM-4587.

@Anant Damle<mailto:ana...@google.com> @Brian 
Hulette<mailto:bhule...@google.com> thanks so much for your support and help!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, February 3, 2021 at 10:51 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

I am also wondering if leveraging this parquet setting 
"parquet.avro.add-list-element-records" along with 
BEAM-11527<https://issues.apache.org/jira/browse/BEAM-11527> can solve my 
problem...

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, February 3, 2021 at 11:55 AM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi all,

Thanks for all the discussions so far (including discussions in BEAM-11721 and 
offline discussions). We will use BEAM-11650 to track the request of making 
avro schema optional for ParquetIO.read operation. I can potentially work on 
that ticket later.

There is another issue that I hope to get some help with from the beam 
community. I have posted this question on beam slack channels but I am 
duplicating it here for visibility. Basically I am using ParquetIO (which uses 
AvroParquetReader) to read spark created parquet files (please see attached). 
The inspection result is below. You can see the spark schema is very simple, 
which is just a field of an array of integers:

creator: parquet-mr version 1.10.1 (build 
815bcfa4a4aacf66d207b3dc692150d16b5740b9)
extra: org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"numbers","type":
{"type":"array","elementType":"integer","containsNull":true}
,"nullable":true,"metadata":{}}]}
file schema: spark_schema

numbers: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL INT32 R:1 D:3


When I use ParquetIO to read this file, the Avro schema for the 
PCollection becomes:
{
  "type": "record",
 "name": "spark_schema",
  "fields": [
{
  "name": "numbers",
  "type": [
"null",
{
  "type": "array",
  "items": {
"type": "record",
"name": "list",
"fields": [
  {
"name": "element",
"type": [
  "null",
  "int"
],
"default": null
  }
]
  }
}
  ],
  "default": null
}
  ]
}


You can see that the schema becomes an array of record type (which contains a 
"element" field). The reason is probably that internally spark parquet is 
defining a “list” record type.

The problem is that this avro schema is not the one I want deal with in the 
following beam transforms. Instead I want to retain the original schema defined 
in spark which is simply an array of integers. Is there an easy way to retain 
the original schema when using ParquetIO to read spark created fields? Did 
anyone run into this need? Please advise. Thanks!



From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 11:21 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Please let me rephrase my question. It's understandable that it might be a good 
practice to specify avro schema when reading parquet files (e.g. to support 
schema evolution etc). But sometimes the overhead is more than the benefits. 
Given that AvroParquetReader does not require an avro schema, is it possible to 
make the avro schema specification optional for ParquetIO.read? Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Saturday, January 30, 2021 at 1:54 PM
To: "user@beam.apache.org" 
Subject: Re: Potential bug with ParquetIO.read when reading arrays

Hi all,

I have made a good progress to figure out the root cause to this issue. The 
details are in BEAM-11721. I asked some questions at the end of the jira and I 
am just duplicating it here for visibility. Thanks so much for the help and 
support from the community!


  1.  It's not quite intuitive to create a avro schema for ParquetIO, which 
contains spark defined fields ("list&

Regarding the over window query support from Beam SQL

2021-03-01 Thread Tao Li
Hi Beam community,

Querying over a window for ranking etc is pretty common in SQL use cases. I 
have found this jira https://issues.apache.org/jira/browse/BEAM-9198

Do we have a plan to support this? If there is no such plan in near future, are 
Beam developers supposed to implement this function on their own (e.g. by using 
GroupBy)? Thanks!


Re: Regarding the over window query support from Beam SQL

2021-03-02 Thread Tao Li
+ Rui Wang. Looks like Rui has been working on this jira.


From: Tao Li 
Date: Monday, March 1, 2021 at 9:51 PM
To: "user@beam.apache.org" 
Subject: Regarding the over window query support from Beam SQL

Hi Beam community,

Querying over a window for ranking etc is pretty common in SQL use cases. I 
have found this jira https://issues.apache.org/jira/browse/BEAM-9198

Do we have a plan to support this? If there is no such plan in near future, are 
Beam developers supposed to implement this function on their own (e.g. by using 
GroupBy)? Thanks!


A problem with ZetaSQL

2021-03-02 Thread Tao Li
Hi all,

I was following the instructions from this doc to play with ZetaSQL  
https://beam.apache.org/documentation/dsls/sql/overview/

The query is really simple:

options.as(BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner")
input.apply(SqlTransform.query("SELECT * from PCOLLECTION"))

I am seeing this error with ZetaSQL  :

Exception in thread "main" java.lang.UnsupportedOperationException: Unknown 
Calcite type: INTEGER
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSqlCalciteTranslationUtils.toZetaSqlType(ZetaSqlCalciteTranslationUtils.java:114)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addFieldsToTable(SqlAnalyzer.java:359)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addTableToLeafCatalog(SqlAnalyzer.java:350)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.lambda$createPopulatedCatalog$1(SqlAnalyzer.java:225)
at 
com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.createPopulatedCatalog(SqlAnalyzer.java:225)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel(ZetaSQLPlannerImpl.java:102)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal(ZetaSQLQueryPlanner.java:180)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel(ZetaSQLQueryPlanner.java:168)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:114)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:140)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:86)

This query works fine when using Calcite (by just removing setPlannerName 
call). Am I missing anything here? For example I am specifying 
'com.google.guava:guava:23.0' as the dependency.

Thanks!




Re: A problem with ZetaSQL

2021-03-02 Thread Tao Li
Hi Brian,

Here is my code to create the PCollection.

PCollection files = pipeline
.apply(FileIO.match().filepattern(path))
.apply(FileIO.readMatches());

PCollection input =  files
.apply(ParquetIO.readFiles(avroSchema))
.apply(MapElements
.into(TypeDescriptors.rows())

.via(AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(avroSchema
.setCoder(RowCoder.of(AvroUtils.toBeamSchema(avroSchema)));


From: Brian Hulette 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, March 2, 2021 at 10:31 AM
To: user 
Subject: Re: A problem with ZetaSQL

Thanks for reporting this Tao - could you share what the type of your input 
PCollection is?

On Tue, Mar 2, 2021 at 9:33 AM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi all,

I was following the instructions from this doc to play with ZetaSQL  
https://beam.apache.org/documentation/dsls/sql/overview/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fdsls%2Fsql%2Foverview%2F&data=04%7C01%7Ctaol%40zillow.com%7Cde9c6a92756146a41b8308d8dda95de7%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637503066785410226%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jv7rLyLR5pHlokEv1Ngnglfp%2Fvw6Ui5Mzn%2BfvJ4B104%3D&reserved=0>

The query is really simple:

options.as<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Foptions.as%2F&data=04%7C01%7Ctaol%40zillow.com%7Cde9c6a92756146a41b8308d8dda95de7%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637503066785410226%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=q0epUsinWTFpWWJ%2BjjtAFw5RRasgT2ivm5%2FG%2FrXU1Hg%3D&reserved=0>(BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner")
input.apply(SqlTransform.query("SELECT * from PCOLLECTION"))

I am seeing this error with ZetaSQL  :

Exception in thread "main" java.lang.UnsupportedOperationException: Unknown 
Calcite type: INTEGER
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSqlCalciteTranslationUtils.toZetaSqlType(ZetaSqlCalciteTranslationUtils.java:114)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addFieldsToTable(SqlAnalyzer.java:359)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addTableToLeafCatalog(SqlAnalyzer.java:350)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.lambda$createPopulatedCatalog$1(SqlAnalyzer.java:225)
at 
com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.createPopulatedCatalog(SqlAnalyzer.java:225)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel(ZetaSQLPlannerImpl.java:102)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal(ZetaSQLQueryPlanner.java:180)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel(ZetaSQLQueryPlanner.java:168)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:114)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:140)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:86)

This query works fine when using Calcite (by just removing setPlannerName 
call). Am I missing anything here? For example I am specifying 
'com.google.guava:guava:23.0' as the dependency.

Thanks!




Re: Regarding the over window query support from Beam SQL

2021-03-02 Thread Tao Li
Hi Rui,

Thanks for this info. It’s good to know we are already supporting the window 
function. But I still have a problem with the schema of the query result.

This is my code (with Beam 2.28):

Schema appSchema = Schema
.builder()
.addInt32Field("foo")
.addInt32Field("bar")
.build();

Row rowOne = Row.withSchema(appSchema).addValues(1, 1).build();
Row rowTwo = Row.withSchema(appSchema).addValues(1, 2).build();

PCollection inputRows = executionContext.getPipeline()
.apply(Create.of(rowOne, rowTwo))
.setRowSchema(appSchema);

String sql = "SELECT foo, bar, RANK() over (PARTITION BY foo ORDER BY 
bar) AS agg FROM PCOLLECTION";
PCollection result  = inputRows.apply("sql", 
SqlTransform.query(sql));

I can see the expected data from result, but I don’t see “agg” column in the 
schema. Do you have any ideas regarding this issue? Thanks!


The Beam schema of the result is:

Field{name=foo, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=bar, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=w0$o0, description=, type=FieldType{typeName=INT64, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}


Here are some detailed logs if they are helpful:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT `PCOLLECTION`.`foo`, `PCOLLECTION`.`bar`, RANK() OVER (PARTITION BY 
`PCOLLECTION`.`foo` ORDER BY `PCOLLECTION`.`bar`) AS `agg`
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(foo=[$0], bar=[$1], agg=[RANK() OVER (PARTITION BY $0 ORDER BY 
$1 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamWindowRel(window#0=[window(partition {0} order by [1] range between 
UNBOUNDED PRECEDING and CURRENT ROW aggs [RANK()])])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])









From: Rui Wang 
Date: Tuesday, March 2, 2021 at 10:43 AM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: Regarding the over window query support from Beam SQL

Hi Tao,

[1] contains what functions are working with OVER clause. Rank is one of the 
functions that is supported. Can you take a look?


[1]: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamAnalyticFunctionsTest.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fjava%2Fextensions%2Fsql%2Fsrc%2Ftest%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fextensions%2Fsql%2FBeamAnalyticFunctionsTest.java&data=04%7C01%7Ctaol%40zillow.com%7C18a004907e2549df27f908d8ddab09eb%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637503073974502000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2dcir%2FC4seUtnW0a0cwX5%2Bax%2FQQwoJkzGsdheTulN2A%3D&reserved=0>

-Rui

On Tue, Mar 2, 2021 at 9:24 AM Tao Li mailto:t...@zillow.com>> 
wrote:
+ Rui Wang. Looks like Rui has been working on this jira.


From: Tao Li mailto:t...@zillow.com>>
Date: Monday, March 1, 2021 at 9:51 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Regarding the over window query support from Beam SQL

Hi Beam community,

Querying over a window for ranking etc is pretty common in SQL use cases. I 
have found this jira 
https://issues.apache.org/jira/browse/BEAM-9198<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-9198&data=04%7C01%7Ctaol%40zillow.com%7C18a004907e2549df27f908d8ddab09eb%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637503073974511958%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EF9vIcAlS4niAEexfYW8Wf2TslCaQepKzZW9sts0qkg%3D&reserved=0>

Do we have a plan to support this? If there is no such plan in near future, are 
Beam developers supposed to implement this function on their own (e.g. by using 
GroupBy)? Thanks!


Does writeDynamic() support writing different element groups to different output paths?

2021-03-03 Thread Tao Li
Hi Beam community,

I have a streaming app that writes every hour’s data to a folder named with 
this hour. With Flink (for example), we can leverage “Bucketing File Sink”: 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/filesystem_sink.html

However I am not seeing Beam FileIO’s writeDynamic API supports specifying 
different output paths for different groups: 
https://beam.apache.org/releases/javadoc/2.28.0/index.html?org/apache/beam/sdk/io/FileIO.html

Seems like writeDynamic() only supports specifying different naming strategy.

How can I specify different hourly based output paths for hourly data with Beam 
writeDynamic? Please advise. Thanks!




Re: Does writeDynamic() support writing different element groups to different output paths?

2021-03-04 Thread Tao Li
Thanks Kobe let me give it a try!

From: Kobe Feng 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, March 3, 2021 at 9:33 PM
To: "user@beam.apache.org" 
Cc: Yuchu Cao 
Subject: Re: Does writeDynamic() support writing different element groups to 
different output paths?

I used the following way long time ago for writing into partitions in hdfs 
(maybe better solutions from others), and not sure any interface change which 
you need to check:

val baseDir = HadoopClient.resolve(basePath, env)
datum.apply("darwin.write.hadoop.parquet." + postfix, 
FileIO.writeDynamic[String, GenericRecord]()
  .by(recordPartition.partitionFunc)
  .withDestinationCoder(StringUtf8Coder.of())
  .via(DarwinParquetIO.sink(...)
  .to(baseDir)
   ...
  .withNaming((partitionFolder: String) => 
relativeFileNaming(StaticValueProvider.of[String](baseDir + Path.SEPARATOR + 
partitionFolder), fileNaming))
   ...

val partitionFunc: T => String



the good practice is auto-switch: using event time field from record value for 
partitioning when event time window, or process time.

and partitionFunc could consider multi partition columns to get subdirectories 
base on ur file system path separator, e.g. S3.

On Wed, Mar 3, 2021 at 5:36 PM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi Beam community,

I have a streaming app that writes every hour’s data to a folder named with 
this hour. With Flink (for example), we can leverage “Bucketing File Sink”: 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/filesystem_sink.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.11%2Fdev%2Fconnectors%2Ffilesystem_sink.html&data=04%7C01%7Ctaol%40zillow.com%7C4e1d0f59c7684f3c2de908d8decf0583%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637504328030924936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UdztxrPHWE%2B94FslOWpJQpovdB8XJJk7sNYcY6KPP3U%3D&reserved=0>

However I am not seeing Beam FileIO’s writeDynamic API supports specifying 
different output paths for different groups: 
https://beam.apache.org/releases/javadoc/2.28.0/index.html?org/apache/beam/sdk/io/FileIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.28.0%2Findex.html%3Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FFileIO.html&data=04%7C01%7Ctaol%40zillow.com%7C4e1d0f59c7684f3c2de908d8decf0583%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637504328030934892%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lZuUQJAvuSxgCUNP%2BckbHQLqNq8u%2FcGMAXFSA2KOqW0%3D&reserved=0>

Seems like writeDynamic() only supports specifying different naming strategy.

How can I specify different hourly based output paths for hourly data with Beam 
writeDynamic? Please advise. Thanks!




--
Yours Sincerely
Kobe Feng


Re: Does writeDynamic() support writing different element groups to different output paths?

2021-03-04 Thread Tao Li
I tried below code:

inputData.apply(FileIO.writeDynamic()
.by(record -> "test")
.via(ParquetIO.sink(inputAvroSchema))
.to(outputPath)
.withNaming(new SimpleFunction() {
@Override
public FileIO.Write.FileNaming apply(String input) {
return  FileIO.Write.relativeFileNaming(
ValueProvider.StaticValueProvider.of(outputPath 
+ "/" + input), naming);
}
})
.withDestinationCoder(StringUtf8Coder.of()));

Exception in thread "main" java.lang.IllegalArgumentException: unable to 
deserialize FileBasedSink
at 
org.apache.beam.sdk.util.SerializableUtils.deserializeFromByteArray(SerializableUtils.java:78)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.WriteFilesTranslation.sinkFromProto(WriteFilesTranslation.java:125)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.WriteFilesTranslation.getSink(WriteFilesTranslation.java:137)
at 
org.apache.beam.runners.direct.WriteWithShardingFactory.getReplacementTransform(WriteWithShardingFactory.java:69)
at 
org.apache.beam.sdk.Pipeline.applyReplacement(Pipeline.java:564)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:299)

When I switch to use write() API as below, it works fine. Does anyone have any 
ideas? Thanks!

inputData.apply(FileIO.write()
.withNumShards(10)
.via(ParquetIO.sink(inputAvroSchema))
.to(outputPath)
.withSuffix(".parquet"));


From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, March 4, 2021 at 9:36 AM
To: "user@beam.apache.org" , Kobe Feng 

Cc: Yuchu Cao 
Subject: Re: Does writeDynamic() support writing different element groups to 
different output paths?

Thanks Kobe let me give it a try!

From: Kobe Feng 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, March 3, 2021 at 9:33 PM
To: "user@beam.apache.org" 
Cc: Yuchu Cao 
Subject: Re: Does writeDynamic() support writing different element groups to 
different output paths?

I used the following way long time ago for writing into partitions in hdfs 
(maybe better solutions from others), and not sure any interface change which 
you need to check:

val baseDir = HadoopClient.resolve(basePath, env)
datum.apply("darwin.write.hadoop.parquet." + postfix, 
FileIO.writeDynamic[String, GenericRecord]()
  .by(recordPartition.partitionFunc)
  .withDestinationCoder(StringUtf8Coder.of())
  .via(DarwinParquetIO.sink(...)
  .to(baseDir)
   ...
  .withNaming((partitionFolder: String) => 
relativeFileNaming(StaticValueProvider.of[String](baseDir + Path.SEPARATOR + 
partitionFolder), fileNaming))
   ...

val partitionFunc: T => String



the good practice is auto-switch: using event time field from record value for 
partitioning when event time window, or process time.

and partitionFunc could consider multi partition columns to get subdirectories 
base on ur file system path separator, e.g. S3.

On Wed, Mar 3, 2021 at 5:36 PM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi Beam community,

I have a streaming app that writes every hour’s data to a folder named with 
this hour. With Flink (for example), we can leverage “Bucketing File Sink”: 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/filesystem_sink.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.11%2Fdev%2Fconnectors%2Ffilesystem_sink.html&data=04%7C01%7Ctaol%40zillow.com%7C4e1d0f59c7684f3c2de908d8decf0583%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637504328030924936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UdztxrPHWE%2B94FslOWpJQpovdB8XJJk7sNYcY6KPP3U%3D&reserved=0>

However I am not seeing Beam FileIO’s writeDynamic API supports specifying 
different output paths for different groups: 
https://beam.apache.org/releases/javadoc/2.28.0/index.html?org/apache/beam/sdk/io/FileIO.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.28.0%2Findex.html%3Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FFileIO.html&data=04%7C01%7Ctaol%40zillow.com%7C4e1d0f59c7684f3c2de908d8decf0583%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637504328030934892%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lZuUQJAvuSxgCUNP%2BckbHQLqNq8u%2FcGMAXFSA2KOqW0%3D&reserved=0>

Seems like writeDynamic() only supports specifying different naming strategy.

How can I specify different hourly based output paths for hourly data with Beam 
writeDynamic? Please advise. Thanks!




--
Yours Sincerely
Kobe Feng


Re: Does writeDynamic() support writing different element groups to different output paths?

2021-03-04 Thread Tao Li
I was able to resolve “unable to deserialize FileBasedSink” error by adding 
withNumShards().

inputData.apply(FileIO.writeDynamic()
.by(record -> "test")
.withDestinationCoder(StringUtf8Coder.of())
.via(ParquetIO.sink(inputAvroSchema))
.to(outputPath)
.withNumShards(10)
.withNaming(new SimpleFunction() {
@Override
public FileIO.Write.FileNaming apply(String input) {
return  FileIO.Write.relativeFileNaming(
ValueProvider.StaticValueProvider.of(outputPath 
+ "/" + input), naming);
}
}));

Now I am seeing a new error as below. Is this related to 
https://issues.apache.org/jira/browse/BEAM-9868? I don’t quite understand what 
this error means. Please advise.

Exception in thread "main" java.lang.IllegalArgumentException: unable to 
deserialize Custom DoFn With Execution Info
at 
org.apache.beam.sdk.util.SerializableUtils.deserializeFromByteArray(SerializableUtils.java:78)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.doFnWithExecutionInformationFromProto(ParDoTranslation.java:709)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getSchemaInformation(ParDoTranslation.java:392)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getSchemaInformation(ParDoTranslation.java:377)
at 
org.apache.beam.runners.direct.ParDoEvaluatorFactory.forApplication(ParDoEvaluatorFactory.java:87)

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, March 4, 2021 at 11:52 AM
To: "user@beam.apache.org" , Kobe Feng 

Cc: Yuchu Cao 
Subject: Re: Does writeDynamic() support writing different element groups to 
different output paths?

I tried below code:

inputData.apply(FileIO.writeDynamic()
.by(record -> "test")
.via(ParquetIO.sink(inputAvroSchema))
.to(outputPath)
.withNaming(new SimpleFunction() {
@Override
public FileIO.Write.FileNaming apply(String input) {
return  FileIO.Write.relativeFileNaming(
ValueProvider.StaticValueProvider.of(outputPath 
+ "/" + input), naming);
}
})
.withDestinationCoder(StringUtf8Coder.of()));

Exception in thread "main" java.lang.IllegalArgumentException: unable to 
deserialize FileBasedSink
at 
org.apache.beam.sdk.util.SerializableUtils.deserializeFromByteArray(SerializableUtils.java:78)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.WriteFilesTranslation.sinkFromProto(WriteFilesTranslation.java:125)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.WriteFilesTranslation.getSink(WriteFilesTranslation.java:137)
at 
org.apache.beam.runners.direct.WriteWithShardingFactory.getReplacementTransform(WriteWithShardingFactory.java:69)
at 
org.apache.beam.sdk.Pipeline.applyReplacement(Pipeline.java:564)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:299)

When I switch to use write() API as below, it works fine. Does anyone have any 
ideas? Thanks!

inputData.apply(FileIO.write()
.withNumShards(10)
.via(ParquetIO.sink(inputAvroSchema))
    .to(outputPath)
.withSuffix(".parquet"));


From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Thursday, March 4, 2021 at 9:36 AM
To: "user@beam.apache.org" , Kobe Feng 

Cc: Yuchu Cao 
Subject: Re: Does writeDynamic() support writing different element groups to 
different output paths?

Thanks Kobe let me give it a try!

From: Kobe Feng 
Reply-To: "user@beam.apache.org" 
Date: Wednesday, March 3, 2021 at 9:33 PM
To: "user@beam.apache.org" 
Cc: Yuchu Cao 
Subject: Re: Does writeDynamic() support writing different element groups to 
different output paths?

I used the following way long time ago for writing into partitions in hdfs 
(maybe better solutions from others), and not sure any interface change which 
you need to check:

val baseDir = HadoopClient.resolve(basePath, env)
datum.apply("darwin.write.hadoop.parquet." + postfix, 
FileIO.writeDynamic[String, GenericRecord]()
  .by(recordPartition.partitionFunc)
  .withDestinationCoder(StringUtf8Coder.of())
  .via(DarwinParquetIO.sink(...)
  .to(baseDir)
   ...
  .withNaming((partitionFolder: String) => 
relativeFileNaming(StaticValueProvider.of[String](baseDir + Path.SEPARATOR + 
partitionFolder), fileNaming))
   ...

val partitionFunc: T =&

Re: A problem with ZetaSQL

2021-03-04 Thread Tao Li
Brian the schema is really simple. Just 3 primitive type columns:

root
|-- column_1: integer (nullable = true)
|-- column_2: integer (nullable = true)
|-- column_3: string (nullable = true)


From: Brian Hulette 
Date: Thursday, March 4, 2021 at 2:29 PM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: A problem with ZetaSQL

Thanks, It would also be helpful to know what avroSchema is, or at least the 
types of its fields, so we can understand what the schema of the PCollection is.

On Tue, Mar 2, 2021 at 11:00 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Brian,

Here is my code to create the PCollection.

PCollection files = pipeline
.apply(FileIO.match().filepattern(path))
.apply(FileIO.readMatches());

PCollection input =  files
.apply(ParquetIO.readFiles(avroSchema))
.apply(MapElements
.into(TypeDescriptors.rows())

.via(AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(avroSchema
.setCoder(RowCoder.of(AvroUtils.toBeamSchema(avroSchema)));


From: Brian Hulette mailto:bhule...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, March 2, 2021 at 10:31 AM
To: user mailto:user@beam.apache.org>>
Subject: Re: A problem with ZetaSQL

Thanks for reporting this Tao - could you share what the type of your input 
PCollection is?

On Tue, Mar 2, 2021 at 9:33 AM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi all,

I was following the instructions from this doc to play with ZetaSQL  
https://beam.apache.org/documentation/dsls/sql/overview/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fdsls%2Fsql%2Foverview%2F&data=04%7C01%7Ctaol%40zillow.com%7C44e3c1a4455172a108d8df5d0428%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637504937882864479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RAXCN9Fbze5N41n35EkgY%2BkNn7pvN1Exib6%2BUr7Df3k%3D&reserved=0>

The query is really simple:

options.as<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Foptions.as%2F&data=04%7C01%7Ctaol%40zillow.com%7C44e3c1a4455172a108d8df5d0428%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637504937882864479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UdcvMpWl%2FfmUhxlIu7igK1yTRMDWgIpA7bV2yKYlInU%3D&reserved=0>(BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner")
input.apply(SqlTransform.query("SELECT * from PCOLLECTION"))

I am seeing this error with ZetaSQL  :

Exception in thread "main" java.lang.UnsupportedOperationException: Unknown 
Calcite type: INTEGER
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSqlCalciteTranslationUtils.toZetaSqlType(ZetaSqlCalciteTranslationUtils.java:114)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addFieldsToTable(SqlAnalyzer.java:359)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addTableToLeafCatalog(SqlAnalyzer.java:350)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.lambda$createPopulatedCatalog$1(SqlAnalyzer.java:225)
at 
com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.createPopulatedCatalog(SqlAnalyzer.java:225)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel(ZetaSQLPlannerImpl.java:102)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal(ZetaSQLQueryPlanner.java:180)
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel(ZetaSQLQueryPlanner.java:168)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:114)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:140)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:86)

This query works fine when using Calcite (by just removing setPlannerName 
call). Am I missing anything here? For example I am specifying 
'com.google.guava:guava:23.0' as the dependency.

Thanks!




Re: A problem with ZetaSQL

2021-03-05 Thread Tao Li
Robin/Brian,

I see. Thanks so much for your help!

From: Robin Qiu 
Date: Friday, March 5, 2021 at 12:31 AM
To: Brian Hulette 
Cc: Tao Li , "user@beam.apache.org" 
Subject: Re: A problem with ZetaSQL

Hi Tao,

In ZetaSQL all "integers" are 64 bits. So if your integers in column 1 and 2 
are 32-bit it won't work. In terms of Beam schema it corresponds to INT64 type.

Best,
Robin

On Thu, Mar 4, 2021 at 6:07 PM Brian Hulette 
mailto:bhule...@google.com>> wrote:
Ah, I suspect this is because our ZetaSQL planner only supports 64 bit integers 
(see 
https://beam.apache.org/documentation/dsls/sql/zetasql/data-types/#integer-type<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fdsls%2Fsql%2Fzetasql%2Fdata-types%2F%23integer-type&data=04%7C01%7Ctaol%40zillow.com%7Ca0d4f147c6cb452d000608d8dfb11429%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637505298932886941%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TdBKZ%2Fb1oJK5hCzW0zxaX4Yml8ObOqEwnSBSCW7ess8%3D&reserved=0>).
 +Robin Qiu<mailto:robi...@google.com> maybe we should have a better error 
message for this?

On Thu, Mar 4, 2021 at 5:24 PM Tao Li mailto:t...@zillow.com>> 
wrote:
Brian the schema is really simple. Just 3 primitive type columns:

root
|-- column_1: integer (nullable = true)
|-- column_2: integer (nullable = true)
|-- column_3: string (nullable = true)


From: Brian Hulette mailto:bhule...@google.com>>
Date: Thursday, March 4, 2021 at 2:29 PM
To: Tao Li mailto:t...@zillow.com>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: A problem with ZetaSQL

Thanks, It would also be helpful to know what avroSchema is, or at least the 
types of its fields, so we can understand what the schema of the PCollection is.

On Tue, Mar 2, 2021 at 11:00 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Brian,

Here is my code to create the PCollection.

PCollection files = pipeline
.apply(FileIO.match().filepattern(path))
.apply(FileIO.readMatches());

PCollection input =  files
.apply(ParquetIO.readFiles(avroSchema))
.apply(MapElements
.into(TypeDescriptors.rows())

.via(AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(avroSchema
.setCoder(RowCoder.of(AvroUtils.toBeamSchema(avroSchema)));


From: Brian Hulette mailto:bhule...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, March 2, 2021 at 10:31 AM
To: user mailto:user@beam.apache.org>>
Subject: Re: A problem with ZetaSQL

Thanks for reporting this Tao - could you share what the type of your input 
PCollection is?

On Tue, Mar 2, 2021 at 9:33 AM Tao Li mailto:t...@zillow.com>> 
wrote:
Hi all,

I was following the instructions from this doc to play with ZetaSQL  
https://beam.apache.org/documentation/dsls/sql/overview/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fdsls%2Fsql%2Foverview%2F&data=04%7C01%7Ctaol%40zillow.com%7Ca0d4f147c6cb452d000608d8dfb11429%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637505298932886941%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NHls0A5FN3HVwnbXmTJtppUikUkmYt9AmtPj2OuaVJk%3D&reserved=0>

The query is really simple:

options.as<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Foptions.as%2F&data=04%7C01%7Ctaol%40zillow.com%7Ca0d4f147c6cb452d000608d8dfb11429%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637505298932896898%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WHHeq15IoNXiPmg5grfi%2Bmzi%2FXAp1u%2Bf96DXgPrD6%2Fg%3D&reserved=0>(BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner")
input.apply(SqlTransform.query("SELECT * from PCOLLECTION"))

I am seeing this error with ZetaSQL  :

Exception in thread "main" java.lang.UnsupportedOperationException: Unknown 
Calcite type: INTEGER
at 
org.apache.beam.sdk.extensions.sql.zetasql.ZetaSqlCalciteTranslationUtils.toZetaSqlType(ZetaSqlCalciteTranslationUtils.java:114)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addFieldsToTable(SqlAnalyzer.java:359)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.addTableToLeafCatalog(SqlAnalyzer.java:350)
at 
org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.lambda$createPopulatedCatalog$1(SqlAnalyzer.java:225)
at 
com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406

Re: Regarding the over window query support from Beam SQL

2021-03-05 Thread Tao Li
Hi Rui,

Just following up on this issue. Do you think this is a bug? Is there a 
workaround? Thanks!

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, March 2, 2021 at 3:37 PM
To: Rui Wang 
Cc: "user@beam.apache.org" 
Subject: Re: Regarding the over window query support from Beam SQL

Hi Rui,

Thanks for this info. It’s good to know we are already supporting the window 
function. But I still have a problem with the schema of the query result.

This is my code (with Beam 2.28):

Schema appSchema = Schema
.builder()
.addInt32Field("foo")
.addInt32Field("bar")
.build();

Row rowOne = Row.withSchema(appSchema).addValues(1, 1).build();
Row rowTwo = Row.withSchema(appSchema).addValues(1, 2).build();

PCollection inputRows = executionContext.getPipeline()
.apply(Create.of(rowOne, rowTwo))
.setRowSchema(appSchema);

String sql = "SELECT foo, bar, RANK() over (PARTITION BY foo ORDER BY 
bar) AS agg FROM PCOLLECTION";
PCollection result  = inputRows.apply("sql", 
SqlTransform.query(sql));

I can see the expected data from result, but I don’t see “agg” column in the 
schema. Do you have any ideas regarding this issue? Thanks!


The Beam schema of the result is:

Field{name=foo, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=bar, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=w0$o0, description=, type=FieldType{typeName=INT64, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}


Here are some detailed logs if they are helpful:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT `PCOLLECTION`.`foo`, `PCOLLECTION`.`bar`, RANK() OVER (PARTITION BY 
`PCOLLECTION`.`foo` ORDER BY `PCOLLECTION`.`bar`) AS `agg`
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(foo=[$0], bar=[$1], agg=[RANK() OVER (PARTITION BY $0 ORDER BY 
$1 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamWindowRel(window#0=[window(partition {0} order by [1] range between 
UNBOUNDED PRECEDING and CURRENT ROW aggs [RANK()])])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])









From: Rui Wang 
Date: Tuesday, March 2, 2021 at 10:43 AM
To: Tao Li 
Cc: "user@beam.apache.org" 
Subject: Re: Regarding the over window query support from Beam SQL

Hi Tao,

[1] contains what functions are working with OVER clause. Rank is one of the 
functions that is supported. Can you take a look?


[1]: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamAnalyticFunctionsTest.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fjava%2Fextensions%2Fsql%2Fsrc%2Ftest%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fextensions%2Fsql%2FBeamAnalyticFunctionsTest.java&data=04%7C01%7Ctaol%40zillow.com%7C18a004907e2549df27f908d8ddab09eb%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637503073974502000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2dcir%2FC4seUtnW0a0cwX5%2Bax%2FQQwoJkzGsdheTulN2A%3D&reserved=0>

-Rui

On Tue, Mar 2, 2021 at 9:24 AM Tao Li mailto:t...@zillow.com>> 
wrote:
+ Rui Wang. Looks like Rui has been working on this jira.


From: Tao Li mailto:t...@zillow.com>>
Date: Monday, March 1, 2021 at 9:51 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Regarding the over window query support from Beam SQL

Hi Beam community,

Querying over a window for ranking etc is pretty common in SQL use cases. I 
have found this jira 
https://issues.apache.org/jira/browse/BEAM-9198<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-9198&data=04%7C01%7Ctaol%40zillow.com%7C18a004907e2549df27f908d8ddab09eb%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637503073974511958%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EF9vIcAlS4niAEexfYW8Wf2TslCaQepKzZW9sts0qkg%3D&reserved=0>

Do we have a plan to support this? If there is no such plan in near future, are 
Beam developers supposed to implement this function on their own (e.g. by using 
GroupBy)? Thanks!


Re: Regarding the over window query support from Beam SQL

2021-03-05 Thread Tao Li
Hi Rui,

Yes that’s the problem. The alias is not propagated to the final schema.

Created https://issues.apache.org/jira/browse/BEAM-11930

Thanks!

From: Rui Wang 
Reply-To: "user@beam.apache.org" 
Date: Friday, March 5, 2021 at 11:31 AM
To: user 
Subject: Re: Regarding the over window query support from Beam SQL

I see. So the problem is the alias does appear in the output schema?

Based on your log: the logical plan contains the "agg" as alias but the 
physical plan (the BeamWindowRel) seems not showing the alias.

I think it's worth opening a JIRA now to further investigate why the alias did 
not correctly pass through. The entry point is to investigate from 
BeamWindowRel.

-Rui

On Fri, Mar 5, 2021 at 10:20 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Rui,

Just following up on this issue. Do you think this is a bug? Is there a 
workaround? Thanks!

From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, March 2, 2021 at 3:37 PM
To: Rui Wang mailto:amaliu...@apache.org>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Regarding the over window query support from Beam SQL

Hi Rui,

Thanks for this info. It’s good to know we are already supporting the window 
function. But I still have a problem with the schema of the query result.

This is my code (with Beam 2.28):

Schema appSchema = Schema
.builder()
.addInt32Field("foo")
.addInt32Field("bar")
.build();

Row rowOne = Row.withSchema(appSchema).addValues(1, 1).build();
Row rowTwo = Row.withSchema(appSchema).addValues(1, 2).build();

PCollection inputRows = executionContext.getPipeline()
.apply(Create.of(rowOne, rowTwo))
.setRowSchema(appSchema);

String sql = "SELECT foo, bar, RANK() over (PARTITION BY foo ORDER BY 
bar) AS agg FROM PCOLLECTION";
PCollection result  = inputRows.apply("sql", 
SqlTransform.query(sql));

I can see the expected data from result, but I don’t see “agg” column in the 
schema. Do you have any ideas regarding this issue? Thanks!


The Beam schema of the result is:

Field{name=foo, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=bar, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=w0$o0, description=, type=FieldType{typeName=INT64, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}


Here are some detailed logs if they are helpful:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT `PCOLLECTION`.`foo`, `PCOLLECTION`.`bar`, RANK() OVER (PARTITION BY 
`PCOLLECTION`.`foo` ORDER BY `PCOLLECTION`.`bar`) AS `agg`
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(foo=[$0], bar=[$1], agg=[RANK() OVER (PARTITION BY $0 ORDER BY 
$1 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamWindowRel(window#0=[window(partition {0} order by [1] range between 
UNBOUNDED PRECEDING and CURRENT ROW aggs [RANK()])])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])









From: Rui Wang mailto:amaliu...@apache.org>>
Date: Tuesday, March 2, 2021 at 10:43 AM
To: Tao Li mailto:t...@zillow.com>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Regarding the over window query support from Beam SQL

Hi Tao,

[1] contains what functions are working with OVER clause. Rank is one of the 
functions that is supported. Can you take a look?


[1]: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamAnalyticFunctionsTest.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fjava%2Fextensions%2Fsql%2Fsrc%2Ftest%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fextensions%2Fsql%2FBeamAnalyticFunctionsTest.java&data=04%7C01%7Ctaol%40zillow.com%7Cb4a68902ee214a99e2be08d8e00d4c47%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637505695016240771%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cK%2BrOS0wvVFduMQ74FHRCeDzvpxRhwtKMUD341NG

Re: Regarding the over window query support from Beam SQL

2021-03-11 Thread Tao Li
Rui,

I think I found another potential bug with rank().

+++
|column_1|column_2|
+++
|1   |100 |
|1   |200 |
+++
Query using Beam SQL:

SELECT *, RANK() over (PARTITION BY column_1 ORDER BY column_2 DESC) AS agg 
FROM PCOLLECTION

Result:

[1, 200, 2]
[1, 100, 1]

While I expect the result to be:

[1, 200, 1]
[1, 100, 2]

So basically the rank result (by using desc order) seems incorrect to me. Can 
you please take a look at this issue? Thanks!


From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Friday, March 5, 2021 at 1:37 PM
To: "user@beam.apache.org" , Rui Wang 
Subject: Re: Regarding the over window query support from Beam SQL

Hi Rui,

Yes that’s the problem. The alias is not propagated to the final schema.

Created https://issues.apache.org/jira/browse/BEAM-11930

Thanks!

From: Rui Wang 
Reply-To: "user@beam.apache.org" 
Date: Friday, March 5, 2021 at 11:31 AM
To: user 
Subject: Re: Regarding the over window query support from Beam SQL

I see. So the problem is the alias does appear in the output schema?

Based on your log: the logical plan contains the "agg" as alias but the 
physical plan (the BeamWindowRel) seems not showing the alias.

I think it's worth opening a JIRA now to further investigate why the alias did 
not correctly pass through. The entry point is to investigate from 
BeamWindowRel.

-Rui

On Fri, Mar 5, 2021 at 10:20 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Rui,

Just following up on this issue. Do you think this is a bug? Is there a 
workaround? Thanks!

From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Tuesday, March 2, 2021 at 3:37 PM
To: Rui Wang mailto:amaliu...@apache.org>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Regarding the over window query support from Beam SQL

Hi Rui,

Thanks for this info. It’s good to know we are already supporting the window 
function. But I still have a problem with the schema of the query result.

This is my code (with Beam 2.28):

Schema appSchema = Schema
.builder()
.addInt32Field("foo")
.addInt32Field("bar")
.build();

Row rowOne = Row.withSchema(appSchema).addValues(1, 1).build();
Row rowTwo = Row.withSchema(appSchema).addValues(1, 2).build();

PCollection inputRows = executionContext.getPipeline()
.apply(Create.of(rowOne, rowTwo))
.setRowSchema(appSchema);

String sql = "SELECT foo, bar, RANK() over (PARTITION BY foo ORDER BY 
bar) AS agg FROM PCOLLECTION";
PCollection result  = inputRows.apply("sql", 
SqlTransform.query(sql));

I can see the expected data from result, but I don’t see “agg” column in the 
schema. Do you have any ideas regarding this issue? Thanks!


The Beam schema of the result is:

Field{name=foo, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=bar, description=, type=FieldType{typeName=INT32, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}
Field{name=w0$o0, description=, type=FieldType{typeName=INT64, nullable=false, 
logicalType=null, collectionElementType=null, mapKeyType=null, 
mapValueType=null, rowSchema=null, metadata={}}, options={{}}}


Here are some detailed logs if they are helpful:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT `PCOLLECTION`.`foo`, `PCOLLECTION`.`bar`, RANK() OVER (PARTITION BY 
`PCOLLECTION`.`foo` ORDER BY `PCOLLECTION`.`bar`) AS `agg`
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(foo=[$0], bar=[$1], agg=[RANK() OVER (PARTITION BY $0 ORDER BY 
$1 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamWindowRel(window#0=[window(partition {0} order by [1] range between 
UNBOUNDED PRECEDING and CURRENT ROW aggs [RANK()])])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])









From: Rui Wang mailto:amaliu...@apache.org>>
Date: Tuesday, March 2, 2021 at 10:43 AM
To: Tao Li mailto:t...@zillow.com>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Subject: Re: Regarding the over window query support from Beam SQL

Hi Tao,

[1] contains what functions are working with OVER clause. Rank is

How to add/alter/drop a Hive partition from a Beam app

2021-03-12 Thread Tao Li
Hi Beam community,

I am wondering how we can use some Beam APIs or Beam SQL to perform some Hive 
DDL operations such as add/alter/drop a partition. I guess I might need to use 
HCatalogIO, however I am not sure about what syntax to use. Please advise. 
Thanks!


Is there a perf comparison between Beam (on spark) and native Spark?

2021-03-22 Thread Tao Li
Hi Beam community,

I am wondering if there is a doc to compare perf of Beam (on Spark) and native 
spark for batch processing? For example using TPCDS benmark.

I did find some relevant links like 
this
 but it’s old and it mostly covers the streaming scenarios.

Thanks!


Re: Is there a perf comparison between Beam (on spark) and native Spark?

2021-03-25 Thread Tao Li
Thanks @Alexey Romanenko<mailto:aromanenko@gmail.com> for this info. Do we 
have a rough idea how Beam (on spark) compares with native Spark by using TPCDS 
or any benchmarks? I am just wondering if run Beam sql with Spark runner will 
have a similar processing time compared with Spark sql. Thanks!

From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Tuesday, March 23, 2021 at 12:58 PM
To: "user@beam.apache.org" 
Subject: Re: Is there a perf comparison between Beam (on spark) and native 
Spark?

There is an extension in Beam to support TPC-DS benchmark [1] that basically 
runs TPC-DS SQL queries via Beam SQL. Though, I’m not sure if it runs regularly 
and, IIRC (when I took a look on this last time, maybe I’m mistaken), it 
requires some adjustments to run on any other runners than Dataflow. Also, when 
I tried to run it on SparkRunner many queries failed because of different 
reasons [2].

I believe that if we will manage to make it running for most of the queries on 
any runner then it will be a good addition to Nexmark benchmark that we have 
for now since TPC-DS results can be used to compare with other data processing 
systems as well.

[1] 
https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Ftree%2Fmaster%2Fsdks%2Fjava%2Ftesting%2Ftpcds&data=04%7C01%7Ctaol%40zillow.com%7C3a7b26c3aead4633412408d8ee361603%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637521263368804132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4Tjd1BcEHRJQUsH9DK1ASVM496nNaqZGetFD4%2F46B7k%3D&reserved=0>
[2] 
https://issues.apache.org/jira/browse/BEAM-9891<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-9891&data=04%7C01%7Ctaol%40zillow.com%7C3a7b26c3aead4633412408d8ee361603%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637521263368804132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ibmzJ3cPSHzDjVPBR4A5jTQTs2O2obmh%2FDQG2X3UBSg%3D&reserved=0>


On 22 Mar 2021, at 18:00, Tao Li mailto:t...@zillow.com>> 
wrote:

Hi Beam community,

I am wondering if there is a doc to compare perf of Beam (on Spark) and native 
spark for batch processing? For example using TPCDS benmark.

I did find some relevant links like 
this<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.fosdem.org%2F2018%2Fschedule%2Fevent%2Fnexmark_benchmarking_suite%2Fattachments%2Fslides%2F2494%2Fexport%2Fevents%2Fattachments%2Fnexmark_benchmarking_suite%2Fslides%2F2494%2FNexmark_Suite_for_Apache_Beam_(FOSDEM18).pdf&data=04%7C01%7Ctaol%40zillow.com%7C3a7b26c3aead4633412408d8ee361603%7C033464830d1840e7a5883784ac50e16f%7C0%7C1%7C637521263368814090%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4Dk5m6rlS8MLhHhiCY42bbGM3qZ2tzRQVxihL1TnL%2BU%3D&reserved=0>
 but it’s old and it mostly covers the streaming scenarios.

Thanks!



Does Beam DynamoDBIO support DynamoDB Streams?

2021-04-03 Thread Tao Li
Hi Beam community,

Does Beam DynamoDBIO support ingesting DynamoDB 
Streams?
 Thanks!


Any ETA of flink 1.12.2 support?

2021-04-12 Thread Tao Li
Hi Beam community,

Beam 2.28.0 supports flink 1.12.1 for flink runner. We are expecting some bug 
fixes from flink 1.12.2. Will flink version be upgraded to 1.12.2 with Beam 
2.29? And is there an ETA for that? Thanks!


Any easy way to extract values from PCollection?

2021-04-21 Thread Tao Li
Hi Beam community,

This is the question I am asking: 
https://stackoverflow.com/questions/28015924/how-to-extract-contents-from-pcollection-in-cloud-dataflow

Thanks!


Re: Any easy way to extract values from PCollection?

2021-04-22 Thread Tao Li
Thanks everyone for your suggestions!

From: Ning Kang 
Reply-To: "user@beam.apache.org" 
Date: Thursday, April 22, 2021 at 10:51 AM
To: "user@beam.apache.org" 
Cc: Yuan Feng 
Subject: Re: Any easy way to extract values from PCollection?

+1 to Brian's answer.

In Java, you can

singleValuedPcollection .apply("Write single value", 
TextIO.write().to(options.getSomeGcsPath())
as the last step of your pipeline.

Then in your program, after executing the pipeline (wait until finish), use the 
Cloud Storage Java client 
library<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloud.google.com%2Fstorage%2Fdocs%2Freference%2Flibraries%23client-libraries-install-java&data=04%7C01%7Ctaol%40zillow.com%7C4e6575041bff4296e29b08d905b73d3b%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547106828526784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xX2JWh4DYPeJNt9H%2FUS1AMvsvFVZ1aXdcb0pMaybAik%3D&reserved=0>
 to read the file and extract the value.

On Thu, Apr 22, 2021 at 10:45 AM Brian Hulette 
mailto:bhule...@google.com>> wrote:
I don't think there's an easy answer to this question, in general all you can 
do with a PCollection is indicate you'd like to write it out to an IO. There 
has been some work in the Python SDK on "Interactive Beam" which is designed 
for using the Python SDK interactively in a notebook environment. It will let 
you collect() a PCollection - meaning it runs the pipeline and materializes the 
result. There's no such capability for the other SDKs though.

On Wed, Apr 21, 2021 at 8:24 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

This is the question I am asking: 
https://stackoverflow.com/questions/28015924/how-to-extract-contents-from-pcollection-in-cloud-dataflow<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F28015924%2Fhow-to-extract-contents-from-pcollection-in-cloud-dataflow&data=04%7C01%7Ctaol%40zillow.com%7C4e6575041bff4296e29b08d905b73d3b%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547106828526784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cdAouxo9JBRfB%2BNLKllTspBWoYDd7jMU7d2%2F9KvWI8A%3D&reserved=0>

Thanks!


Question on late data handling in Beam streaming mode

2021-04-22 Thread Tao Li
Hi Beam community,

I am wondering if there is a risk of losing late data from a Beam stream app 
due to watermarking?

I just went through this design doc and noticed the “droppable” definition 
there: 
https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#

Can you please confirm if it’s possible for us to lose some data in a stream 
app in practice? If that’s possible, what would be the best practice to avoid 
data loss? Thanks!



Re: Question on late data handling in Beam streaming mode

2021-04-23 Thread Tao Li
Thanks @Kenneth Knowles<mailto:k...@apache.org>. I understand we need to 
specify a window for groupby so that the app knowns when processing is “done” 
to output result.

Is it possible to specify a event arrival/processing time based window for 
groupby? The purpose is to avoid dropping of late events. With a event 
processing time based window, the app will periodically output the result based 
on all events that arrived in that window, and a late arriving event will fall 
into whatever window covers its arrival time and thus that late data will not 
get lost.

Does Beam support this kind of mechanism? Thanks.

From: Kenneth Knowles 
Reply-To: "user@beam.apache.org" 
Date: Thursday, April 22, 2021 at 1:49 PM
To: user 
Cc: Kelly Smith , Lian Jiang 
Subject: Re: Question on late data handling in Beam streaming mode

Hello!

In a streaming app, you have two choices: wait forever and never have any 
output OR use some method to decide that aggregation is "done".

In Beam, the way you decide that aggregation is "done" is the watermark. When 
the watermark predicts no more data for an aggregation, then the aggregation is 
done. For example GROUP BY  is "done" when no more data will arrive for 
that minute. At this point, your result is produced. More data may arrive, and 
it is ignored. The watermark is determined by the IO connector to be the best 
heuristic available. You can configure "allowed lateness" for an aggregation to 
allow out of order data.

Kenn

On Thu, Apr 22, 2021 at 1:26 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am wondering if there is a risk of losing late data from a Beam stream app 
due to watermarking?

I just went through this design doc and noticed the “droppable” definition 
there: 
https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y%2Fedit%23&data=04%7C01%7Ctaol%40zillow.com%7C5f68c051a16843dc6e5f08d905d016dc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547213557227210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2Gjz8DNW5JDbFUie010%2FhrEiKajPR7sMMb67lC8vHrU%3D&reserved=0>

Can you please confirm if it’s possible for us to lose some data in a stream 
app in practice? If that’s possible, what would be the best practice to avoid 
data loss? Thanks!



Re: Question on late data handling in Beam streaming mode

2021-04-26 Thread Tao Li
Thanks folks. This is really informative!

From: Kenneth Knowles 
Reply-To: "user@beam.apache.org" 
Date: Friday, April 23, 2021 at 9:34 AM
To: Reuven Lax 
Cc: user , Kenneth Knowles , Kelly Smith 
, Lian Jiang 
Subject: Re: Question on late data handling in Beam streaming mode

Reuven's answer will result in a group by key (but not window) where no data is 
dropped and you get deltas for each key. Downstream consumers can recombine the 
deltas to get per-key aggregation. So instead of putting the time interval into 
the window, you put it into the key, and then you get the same grouped 
aggregation.

There are (at least) two other ways to do this:

1. You can set allowed lateness to a high value.
2. You can use a ParDo and outputWithTimestamp [1] to set the timestamps to 
arrival time. I illustrated this in some older talks [2].

Kenn

[1] 
https://github.com/apache/beam/blob/dc636be57900c8ad9b6b9e50b08dad64be8aee40/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L184<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fdc636be57900c8ad9b6b9e50b08dad64be8aee40%2Fsdks%2Fjava%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Ftransforms%2FDoFn.java%23L184&data=04%7C01%7Ctaol%40zillow.com%7C7c11d6f8809f4f46887108d90675a90a%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547924683482682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vs9%2FRa%2B8ya5%2FIWxtMUa8KUuRvEH4vUbfyDAr%2BbJN3IM%3D&reserved=0>
[2] 
https://docs.google.com/presentation/d/1smGXb-0GGX_Fid1z3WWzZJWtyBjBA3Mo3t4oeRjJoZI/present?slide=id.g142c2fd96f_0_134<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fpresentation%2Fd%2F1smGXb-0GGX_Fid1z3WWzZJWtyBjBA3Mo3t4oeRjJoZI%2Fpresent%3Fslide%3Did.g142c2fd96f_0_134&data=04%7C01%7Ctaol%40zillow.com%7C7c11d6f8809f4f46887108d90675a90a%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547924683492644%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=W6p9rfGk9vLqEW3p%2FlTN2c9Jbf%2B1qacEzu4wX36OVoE%3D&reserved=0>

On Fri, Apr 23, 2021 at 8:32 AM Reuven Lax 
mailto:re...@google.com>> wrote:
You can definitely group by processing time. The way to do this in Beam is as 
follows

Window.into(new GlobalWindows())
.triggering(AfterWatermark.pastEndOfWindow() 
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))
.discardingFiredPanes());

The syntax is a bit unfortunately wordy, but the idea is that you are creating 
a single event-time window that encompasses all time, and "triggering" an 
aggregation every 30 seconds based on processing time.

On Fri, Apr 23, 2021 at 8:14 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Thanks @Kenneth Knowles<mailto:k...@apache.org>. I understand we need to 
specify a window for groupby so that the app knowns when processing is “done” 
to output result.

Is it possible to specify a event arrival/processing time based window for 
groupby? The purpose is to avoid dropping of late events. With a event 
processing time based window, the app will periodically output the result based 
on all events that arrived in that window, and a late arriving event will fall 
into whatever window covers its arrival time and thus that late data will not 
get lost.

Does Beam support this kind of mechanism? Thanks.

From: Kenneth Knowles mailto:k...@apache.org>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Thursday, April 22, 2021 at 1:49 PM
To: user mailto:user@beam.apache.org>>
Cc: Kelly Smith mailto:kell...@zillowgroup.com>>, Lian 
Jiang mailto:li...@zillowgroup.com>>
Subject: Re: Question on late data handling in Beam streaming mode

Hello!

In a streaming app, you have two choices: wait forever and never have any 
output OR use some method to decide that aggregation is "done".

In Beam, the way you decide that aggregation is "done" is the watermark. When 
the watermark predicts no more data for an aggregation, then the aggregation is 
done. For example GROUP BY  is "done" when no more data will arrive for 
that minute. At this point, your result is produced. More data may arrive, and 
it is ignored. The watermark is determined by the IO connector to be the best 
heuristic available. You can configure "allowed lateness" for an aggregation to 
allow out of order data.

Kenn

On Thu, Apr 22, 2021 at 1:26 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I am wondering if there is a risk of losing late data from a Beam stream app 
due to watermarking?

I just went through this design doc and noticed the “droppable” definition 
there: 
https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n

Question on printing out a PCollection

2021-04-29 Thread Tao Li
Hi Beam community,

The notebook console from Google Cloud defines a show() API to display a 
PCollection which is very neat: 
https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development

If we are using a regular jupyter notebook to run beam app, how can we print 
out a PCollection easily? What’s the best practice? Thanks!



Re: Question on printing out a PCollection

2021-04-30 Thread Tao Li
Thanks @Ning Kang.

@Robert Bradshaw I assume you are referring to 
https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.runners.interactive.interactive_beam.html.
 Is there a java version for it?



On 4/30/21, 11:00 AM, "Robert Bradshaw"  wrote:

You can also use interactive Beam's collect, to get the PCollection as
a Dataframe, and then print it or do whatever else with it as you
like.

On Fri, Apr 30, 2021 at 10:24 AM Ning Kang  wrote:
>
> Hi Tao,
>
> The `show()` API works with any IPython notebook runtimes, including 
Colab, Jupyter Lab and pre-lab Jupyter Notebooks, as long as you have `%pip 
install apache-beam[interactive]`.
>
> Additionally, the `show_graph()` API needs GraphViz binary installed, 
details can be found in the README.
>
> If you've created an Apache Beam notebook instance on Google Cloud, there 
is an example notebook "Examples/Visualize_Data.ipynb" demonstrating how to 
visualize data of PCollections with different libraries:
>
> Native Interactive Beam Visualization
> Pandas DataFrame
> Matplotlib
> Seaborn
> Bokeh
> D3.js
>
> Hope this helps!
>
> Ning
>
> On Fri, Apr 30, 2021 at 9:24 AM Brian Hulette  wrote:
>>
>> +Ning Kang +Sam Rohde
>>
>> On Thu, Apr 29, 2021 at 6:13 PM Tao Li  wrote:
>>>
>>> Hi Beam community,
>>>
>>>
>>>
>>> The notebook console from Google Cloud defines a show() API to display 
a PCollection which is very neat: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloud.google.com%2Fdataflow%2Fdocs%2Fguides%2Finteractive-pipeline-development&data=04%7C01%7Ctaol%40zillow.com%7Cc17e88a15ae34f84412908d90c01d404%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637554024259762896%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6tm5K%2BwZhuNhBtxsCu4AIHtBZpoTj2kgfHkeDzjLQ1g%3D&reserved=0
>>>
>>>
>>>
>>> If we are using a regular jupyter notebook to run beam app, how can we 
print out a PCollection easily? What’s the best practice? Thanks!
>>>
>>>



Does SnowflakeIO support spark runner

2021-05-06 Thread Tao Li
Hi Beam community,

Does SnowflakeIO support spark runner? Seems like only direct runner and 
dataflow runner are supported..

Thanks!


Re: Does SnowflakeIO support spark runner

2021-05-06 Thread Tao Li
Hi @Kyle Weaver<mailto:kcwea...@google.com>

According to this 
doc<https://beam.apache.org/documentation/io/built-in/snowflake/>: 
--runner=

From: Kyle Weaver 
Reply-To: "user@beam.apache.org" 
Date: Thursday, May 6, 2021 at 12:01 PM
To: "user@beam.apache.org" 
Cc: Anuj Gandhi 
Subject: Re: Does SnowflakeIO support spark runner

As far as I know, it should be supported (Beam's abstract model means IOs 
usually "just work" on all runners). What makes you think it isn't supported?

On Thu, May 6, 2021 at 11:52 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

Does SnowflakeIO support spark runner? Seems like only direct runner and 
dataflow runner are supported..

Thanks!


Re: Does SnowflakeIO support spark runner

2021-05-06 Thread Tao Li
Thanks Kyle!

From: Kyle Weaver 
Date: Thursday, May 6, 2021 at 12:19 PM
To: Tao Li 
Cc: "user@beam.apache.org" , Anuj Gandhi 

Subject: Re: Does SnowflakeIO support spark runner

Yeah, I'm pretty sure that documentation is just misleading. All of the options 
from --runner onward are runner-specific and don't have anything to do with 
Snowflake, so they should probably be removed from the doc.

On Thu, May 6, 2021 at 12:06 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi @Kyle Weaver<mailto:kcwea...@google.com>

According to this 
doc<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fio%2Fbuilt-in%2Fsnowflake%2F&data=04%7C01%7Ctaol%40zillow.com%7C7be88d03486d4f64c1db08d910c3e67a%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637559255833746635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gnk80TpLC%2BMlclZ%2F5PKFTFc5H2GoQtL2GUOYZvzRj1g%3D&reserved=0>:
 --runner=

From: Kyle Weaver mailto:kcwea...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Thursday, May 6, 2021 at 12:01 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Cc: Anuj Gandhi mailto:an...@zillowgroup.com>>
Subject: Re: Does SnowflakeIO support spark runner

As far as I know, it should be supported (Beam's abstract model means IOs 
usually "just work" on all runners). What makes you think it isn't supported?

On Thu, May 6, 2021 at 11:52 AM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

Does SnowflakeIO support spark runner? Seems like only direct runner and 
dataflow runner are supported..

Thanks!


A problem with calcite sql

2021-05-10 Thread Tao Li
Hi Beam community,

I am seeing a weird issue by using calcite sql. I don’t understand why it’s 
complaining my query is not valid. Once I removed “user AS user”, it worked 
fine. Please advise. Thanks.

Exception in thread "main" 
org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query 
SELECT id AS id, user AS user, market_name AS market_name, 
market_transactionManagement_transactionManagers.email AS 
market_transactionManagement_transactionManagers_email, 
market_transactionManagement_transactionManagers.name AS 
market_transactionManagement_transactionManagers_name, 
market_transactionManagement_transactionProfileId AS 
market_transactionManagement_transactionProfileId FROM PCOLLECTION
at 
org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:214)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:111)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:171)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:547)
at 
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:498)
at 
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:370)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.updateSchemaBasedOnAvroSchema(DatasetFlattenerCore.java:85)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.execute(DatasetFlattenerCore.java:61)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.execute(DatasetFlattenerCore.java:29)
at 
com.zillow.pipeler.orchestrator.BaseOrchestrator.run(BaseOrchestrator.java:61)
at 
com.zillow.pipeler.orchestrator.transform.DatasetFlattenerOrchestrator.main(DatasetFlattenerOrchestrator.java:71)
Caused by: 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.parser.SqlParseException:
 Encountered "AS user" at line 1, column 23.
Was expecting one of:

"ORDER" ...
"LIMIT" ...


Re: A problem with calcite sql

2021-05-10 Thread Tao Li
Never mind. Looks like “user” is a reserved name.

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Monday, May 10, 2021 at 7:10 PM
To: "user@beam.apache.org" 
Cc: Yuan Feng 
Subject: A problem with calcite sql

Hi Beam community,

I am seeing a weird issue by using calcite sql. I don’t understand why it’s 
complaining my query is not valid. Once I removed “user AS user”, it worked 
fine. Please advise. Thanks.

Exception in thread "main" 
org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query 
SELECT id AS id, user AS user, market_name AS market_name, 
market_transactionManagement_transactionManagers.email AS 
market_transactionManagement_transactionManagers_email, 
market_transactionManagement_transactionManagers.name AS 
market_transactionManagement_transactionManagers_name, 
market_transactionManagement_transactionProfileId AS 
market_transactionManagement_transactionProfileId FROM PCOLLECTION
at 
org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:214)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:111)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:171)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:547)
at 
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:498)
at 
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:370)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.updateSchemaBasedOnAvroSchema(DatasetFlattenerCore.java:85)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.execute(DatasetFlattenerCore.java:61)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.execute(DatasetFlattenerCore.java:29)
at 
com.zillow.pipeler.orchestrator.BaseOrchestrator.run(BaseOrchestrator.java:61)
at 
com.zillow.pipeler.orchestrator.transform.DatasetFlattenerOrchestrator.main(DatasetFlattenerOrchestrator.java:71)
Caused by: 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.parser.SqlParseException:
 Encountered "AS user" at line 1, column 23.
Was expecting one of:

"ORDER" ...
"LIMIT" ...


Re: A problem with calcite sql

2021-05-10 Thread Tao Li
Sorry to bug with another question. I was saving a data set with below schema 
(this dataset comes from sql query). Saw the SqlCharType issue. Did anyone see 
this issue before?

[main] INFO com.zillow.pipeler.core.transform.DatasetFlattenerCore - Fields:
Field{name=id, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=user_tmp, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_name, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_transactionManagement_transactionManagers_email, 
description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_transactionManagement_transactionManagers_name, description=, 
type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_transactionManagement_transactionProfileId, description=, 
type=LOGICAL_TYPE NOT NULL, options={{}}}
Options:{{}}
Exception in thread "main" java.lang.RuntimeException: Unhandled logical type 
SqlCharType
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:911)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:348)


From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Monday, May 10, 2021 at 7:19 PM
To: "user@beam.apache.org" 
Cc: Yuan Feng 
Subject: Re: A problem with calcite sql

Never mind. Looks like “user” is a reserved name.

From: Tao Li 
Reply-To: "user@beam.apache.org" 
Date: Monday, May 10, 2021 at 7:10 PM
To: "user@beam.apache.org" 
Cc: Yuan Feng 
Subject: A problem with calcite sql

Hi Beam community,

I am seeing a weird issue by using calcite sql. I don’t understand why it’s 
complaining my query is not valid. Once I removed “user AS user”, it worked 
fine. Please advise. Thanks.

Exception in thread "main" 
org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query 
SELECT id AS id, user AS user, market_name AS market_name, 
market_transactionManagement_transactionManagers.email AS 
market_transactionManagement_transactionManagers_email, 
market_transactionManagement_transactionManagers.name AS 
market_transactionManagement_transactionManagers_name, 
market_transactionManagement_transactionProfileId AS 
market_transactionManagement_transactionProfileId FROM PCOLLECTION
at 
org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:214)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:111)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:171)
at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:547)
at 
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:498)
at 
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:370)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.updateSchemaBasedOnAvroSchema(DatasetFlattenerCore.java:85)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.execute(DatasetFlattenerCore.java:61)
at 
com.zillow.pipeler.core.transform.DatasetFlattenerCore.execute(DatasetFlattenerCore.java:29)
at 
com.zillow.pipeler.orchestrator.BaseOrchestrator.run(BaseOrchestrator.java:61)
at 
com.zillow.pipeler.orchestrator.transform.DatasetFlattenerOrchestrator.main(DatasetFlattenerOrchestrator.java:71)
Caused by: 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.parser.SqlParseException:
 Encountered "AS user" at line 1, column 23.
Was expecting one of:

"ORDER" ...
"LIMIT" ...


Re: A problem with calcite sql

2021-05-11 Thread Tao Li
@Andrew Pilloud<mailto:apill...@google.com> thanks for your suggestions. I 
tried CAST and TRIM but it did not work:

Sql Stmt I am using: SELECT 'CAST(id AS VARCHAR)' FROM PCOLLECTION

Logs:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT 'CAST(id AS VARCHAR)'
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(EXPR$0=['CAST(id AS VARCHAR)'])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamCalcRel(expr#0..44=[{inputs}], expr#45=['CAST(id AS VARCHAR)'], 
EXPR$0=[$t45])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

Exception in thread "main" java.lang.RuntimeException: Unhandled logical type 
SqlCharType
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:911)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:348)

From: Andrew Pilloud 
Reply-To: "user@beam.apache.org" 
Date: Monday, May 10, 2021 at 7:46 PM
To: user 
Cc: Yuan Feng 
Subject: Re: A problem with calcite sql

For the first one you have 
https://issues.apache.org/jira/browse/BEAM-5251<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-5251&data=04%7C01%7Ctaol%40zillow.com%7Cd0d024cf363d4a0df9a708d914270a8b%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637562980181312070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QGT5TNbR17Bb2T2MjBVxD7mfxzIe4XG9%2Bfgr%2BjBvRzY%3D&reserved=0>
For the second, I opened a new issue for you: 
https://issues.apache.org/jira/browse/BEAM-12323<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12323&data=04%7C01%7Ctaol%40zillow.com%7Cd0d024cf363d4a0df9a708d914270a8b%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637562980181322031%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BZ1y7IQcb5Hw93b%2BcrZpe6G%2BDw3E0KcPUDuxjWuV%2BDw%3D&reserved=0>

Your second issue is because our Avro conversion library doesn't know how to 
handle fixed length strings. These normally show up in SQL when you are 
outputting a constant. I'm not sure exactly how to work around it, if you can 
get the output type to be a VARCHAR (instead of CHAR) this problem will go 
away. You might be able to do something like 'CAST("Your String Literal" AS 
VARCHAR)' , 'TRIM("Your String Literal")' or ' "Your String Literal" || "" '.

On Mon, May 10, 2021 at 7:25 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Sorry to bug with another question. I was saving a data set with below schema 
(this dataset comes from sql query). Saw the SqlCharType issue. Did anyone see 
this issue before?

[main] INFO com.zillow.pipeler.core.transform.DatasetFlattenerCore - Fields:
Field{name=id, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=user_tmp, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_name, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_transactionManagement_transactionManagers_email, 
description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_transactionManagement_transactionManagers_name, description=, 
type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_transactionManagement_transactionProfileId, description=, 
type=LOGICAL_TYPE NOT NULL, options={{}}}
Options:{{}}
Exception in thread "main" java.lang.RuntimeException: Unhandled logical type 
SqlCharType
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:911)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
    at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:348)


From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Monday, May 10, 2021 at 7:19 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Cc: Yuan Feng mailto:yua...@zillowgroup.com>>
Subject: Re: A problem with calcite sql

Never mind. Looks like “user” is a reserved name.

From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org<ma

Re: A problem with calcite sql

2021-05-11 Thread Tao Li
Thanks Andrew. With `id` syntax I am not seeing “Unhandled logical type 
SqlCharType” error any more. This is great progress!

However I am still seeing an issue by querying a composite field. Below is the 
schema of the array type field:

Field{name=market_transactionManagement_transactionManagers, description=, 
type=ARRAY>, options={{}}}

My sql query is selecting a nested field: SELECT 
`market_transactionManagement_transactionManagers.email` FROM PCOLLECTION

Error:

Caused by: 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException:
 Column 'market_transactionManagement_transactionManagers.email' not found in 
any table

So what would be the right syntax? Thanks!

From: Andrew Pilloud 
Date: Tuesday, May 11, 2021 at 11:51 AM
To: Tao Li 
Cc: "user@beam.apache.org" , Yuan Feng 

Subject: Re: A problem with calcite sql

SELECT CAST('CAST(id AS VARCHAR)' AS VARCHAR) FROM PCOLLECTION works for me, 
but I don't think that is what you wanted. Note that ' is for string literals 
and ` is for escaping names in Beam SQL's default dialect config.

Try:
SELECT `id` FROM PCOLLECTION

On Tue, May 11, 2021 at 10:58 AM Tao Li 
mailto:t...@zillow.com>> wrote:
@Andrew Pilloud<mailto:apill...@google.com> thanks for your suggestions. I 
tried CAST and TRIM but it did not work:

Sql Stmt I am using: SELECT 'CAST(id AS VARCHAR)' FROM PCOLLECTION

Logs:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT 'CAST(id AS VARCHAR)'
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(EXPR$0=['CAST(id AS VARCHAR)'])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamCalcRel(expr#0..44=[{inputs}], expr#45=['CAST(id AS VARCHAR)'], 
EXPR$0=[$t45])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

Exception in thread "main" java.lang.RuntimeException: Unhandled logical type 
SqlCharType
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:911)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:348)

From: Andrew Pilloud mailto:apill...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Monday, May 10, 2021 at 7:46 PM
To: user mailto:user@beam.apache.org>>
Cc: Yuan Feng mailto:yua...@zillowgroup.com>>
Subject: Re: A problem with calcite sql

For the first one you have 
https://issues.apache.org/jira/browse/BEAM-5251<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-5251&data=04%7C01%7Ctaol%40zillow.com%7C2c8c6047a05842fde53008d914adce2d%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637563559001417793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=C4bGHyhOjB%2Bhh2HPUAAhOKICQfTisBJ1FrE4AWah1QQ%3D&reserved=0>
For the second, I opened a new issue for you: 
https://issues.apache.org/jira/browse/BEAM-12323<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12323&data=04%7C01%7Ctaol%40zillow.com%7C2c8c6047a05842fde53008d914adce2d%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637563559001427748%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QIRqCuy9IBCEjPWWcwmIW1TR%2BB5LjrJAMTK3VzP%2F93s%3D&reserved=0>

Your second issue is because our Avro conversion library doesn't know how to 
handle fixed length strings. These normally show up in SQL when you are 
outputting a constant. I'm not sure exactly how to work around it, if you can 
get the output type to be a VARCHAR (instead of CHAR) this problem will go 
away. You might be able to do something like 'CAST("Your String Literal" AS 
VARCHAR)' , 'TRIM("Your String Literal")' or ' "Your String Literal" || "" '.

On Mon, May 10, 2021 at 7:25 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Sorry to bug with another question. I was saving a data set with below schema 
(this dataset comes from sql query). Saw the SqlCharType issue. Did anyone see 
this issue before?

[main] INFO com.zillow.pipeler.core.transform.DatasetFlattenerCore - Fields:
Field{name=id, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=user_tmp, description=, type=LOGICAL_TYPE NOT NULL, options={{}}}
Field{name=market_name, description=, type=LOGICAL_TYP

Re: A problem with calcite sql

2021-05-12 Thread Tao Li
Andrew,

I tried the last query you recommended, and seeing this error:

Caused by: 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.tools.ValidationException:
 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.runtime.CalciteContextException:
 From line 1, column 34 to line 1, column 44: Table 'PCOLLECTION' not found



From: Andrew Pilloud 
Date: Tuesday, May 11, 2021 at 10:38 PM
To: Tao Li 
Cc: "user@beam.apache.org" , Yuan Feng 

Subject: Re: A problem with calcite sql

If the type was just a nested row this should would work:
SELECT `market_transactionManagement_transactionManagers`.`email` FROM 
PCOLLECTION
or this:
SELECT market_transactionManagement_transactionManagers.email FROM PCOLLECTION

If you have exactly one element in the array something like this should work:
SELECT market_transactionManagement_transactionManagers[1].email FROM 
PCOLLECTION

If you want to extract the array, try something like this:
SELECT manager.email FROM 
UNNEST(PCOLLECTION.market_transactionManagement_transactionManagers) AS manager

On Tue, May 11, 2021 at 10:22 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Thanks Andrew. With `id` syntax I am not seeing “Unhandled logical type 
SqlCharType” error any more. This is great progress!

However I am still seeing an issue by querying a composite field. Below is the 
schema of the array type field:

Field{name=market_transactionManagement_transactionManagers, description=, 
type=ARRAY>, options={{}}}

My sql query is selecting a nested field: SELECT 
`market_transactionManagement_transactionManagers.email` FROM PCOLLECTION

Error:

Caused by: 
org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException:
 Column 'market_transactionManagement_transactionManagers.email' not found in 
any table

So what would be the right syntax? Thanks!

From: Andrew Pilloud mailto:apill...@google.com>>
Date: Tuesday, May 11, 2021 at 11:51 AM
To: Tao Li mailto:t...@zillow.com>>
Cc: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>, Yuan Feng 
mailto:yua...@zillowgroup.com>>
Subject: Re: A problem with calcite sql

SELECT CAST('CAST(id AS VARCHAR)' AS VARCHAR) FROM PCOLLECTION works for me, 
but I don't think that is what you wanted. Note that ' is for string literals 
and ` is for escaping names in Beam SQL's default dialect config.

Try:
SELECT `id` FROM PCOLLECTION

On Tue, May 11, 2021 at 10:58 AM Tao Li 
mailto:t...@zillow.com>> wrote:
@Andrew Pilloud<mailto:apill...@google.com> thanks for your suggestions. I 
tried CAST and TRIM but it did not work:

Sql Stmt I am using: SELECT 'CAST(id AS VARCHAR)' FROM PCOLLECTION

Logs:

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
SELECT 'CAST(id AS VARCHAR)'
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
SQLPlan>
LogicalProject(EXPR$0=['CAST(id AS VARCHAR)'])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

[main] INFO org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner - 
BEAMPlan>
BeamCalcRel(expr#0..44=[{inputs}], expr#45=['CAST(id AS VARCHAR)'], 
EXPR$0=[$t45])
  BeamIOSourceRel(table=[[beam, PCOLLECTION]])

Exception in thread "main" java.lang.RuntimeException: Unhandled logical type 
SqlCharType
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:911)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
at 
org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:348)

From: Andrew Pilloud mailto:apill...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Monday, May 10, 2021 at 7:46 PM
To: user mailto:user@beam.apache.org>>
Cc: Yuan Feng mailto:yua...@zillowgroup.com>>
Subject: Re: A problem with calcite sql

For the first one you have 
https://issues.apache.org/jira/browse/BEAM-5251<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-5251&data=04%7C01%7Ctaol%40zillow.com%7Ce1a0e4eee6164a829fd508d915083af1%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637563947363358745%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Qsb45DB8XYfDorAY9PauHRexgHoUCGS0LNDXbaKU%2B9g%3D&reserved=0>
For the second, I opened a new issue for you: 
https://issues.apache.org/jira/browse/BEAM-12323<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12323&data=04%7C01%7Ctaol%40zillow.com%7Ce1a0e4eee6164a829f

A problem with nexmark build

2021-05-12 Thread Tao Li
Hi Beam community,

I have been following this nexmark doc: 
https://beam.apache.org/documentation/sdks/java/testing/nexmark/

I ran into a problem with “Running query 0 on a Spark cluster with Apache 
Hadoop YARN” section.

I was following the instruction by running “./gradlew 
:sdks:java:testing:nexmark:assemble” command, but did not find the uber jar 
“build/libs/beam-sdks-java-nexmark-2.29.0-spark.jar” that was built locally 
(the nexmark doc is referencing that jar).

Can someone provide some guidance and help? Thanks.




Why is GroupBy involved in the file save operation?

2021-05-21 Thread Tao Li
Hi Beam community,

I wonder why a GroupBy operation is involved in WriteFiles: 
https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html

This doc mentioned “ The exact parallelism of the write stage can be controlled 
using 
withNumShards(int),
 typically used to control how many files are produced or to globally limit the 
number of workers connecting to an external service. However, this option can 
often hurt performance: it adds an additional 
GroupByKey
 to the pipeline.”

When we are saving the PCollection into multiple files, why can’t we simply 
split the PCollection without a key and save each split as a file?

Thanks!


Re: Why is GroupBy involved in the file save operation?

2021-05-21 Thread Tao Li
Reuven thanks for your response.  GroupBy is not involved if we are not 
specifying fixed number of files, correct?

And what’s the implication of not specifying the shard number? Is the 
parallelism determined by the number of spark executors that hold data to save? 
This is assuming we are using spark runner.

What would be the best practice? Specifying the fixed shard number or asking 
beam to figure it out for us?

From: Reuven Lax 
Reply-To: "user@beam.apache.org" 
Date: Friday, May 21, 2021 at 4:27 PM
To: user 
Cc: Lian Jiang 
Subject: Re: Why is GroupBy involved in the file save operation?

What you describe is what happens (at least in the Dataflow runner) if auto 
sharding is specified in batch. This mechanism tries to split the PColllection 
to fully utilize every worker, so is not appropriate when a fixed number of 
shards is desired. A GroupByKey is also necessary in streaming in order to 
split an unbounded PColllection using windows/triggers, as windows and triggers 
are applied during GroupByKey.

On Fri, May 21, 2021 at 4:16 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I wonder why a GroupBy operation is involved in WriteFiles: 
https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FWriteFiles.html&data=04%7C01%7Ctaol%40zillow.com%7Cb15e8362b9f84ca3794508d91cb00d54%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572364724252236%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rOcYW37RAqNkyPX%2BiuA60jBsMyk9OtNMy8y5D4hhNL8%3D&reserved=0>

This doc mentioned “ The exact parallelism of the write stage can be controlled 
using 
withNumShards(int)<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FWriteFiles.html%23withNumShards-int-&data=04%7C01%7Ctaol%40zillow.com%7Cb15e8362b9f84ca3794508d91cb00d54%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572364724262189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Id%2BDaRGTvkksknEAOF5HZDhiOlaSsNluxfc38oGvoTA%3D&reserved=0>,
 typically used to control how many files are produced or to globally limit the 
number of workers connecting to an external service. However, this option can 
often hurt performance: it adds an additional 
GroupByKey<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Ftransforms%2FGroupByKey.html&data=04%7C01%7Ctaol%40zillow.com%7Cb15e8362b9f84ca3794508d91cb00d54%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572364724262189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=unk2nGlUp5ey77DPFwEdceM%2FMQ30RjK0TtxgQzc7g4Q%3D&reserved=0>
 to the pipeline.”

When we are saving the PCollection into multiple files, why can’t we simply 
split the PCollection without a key and save each split as a file?

Thanks!


Re: Why is GroupBy involved in the file save operation?

2021-05-21 Thread Tao Li
Thanks Reuven. Do you know how a bundle size is determined (e.g. spark runner)? 
If we are not specifying shard number, the number of files will be 
total_size/bundle_size?

From: Reuven Lax 
Reply-To: "user@beam.apache.org" 
Date: Friday, May 21, 2021 at 4:46 PM
To: user 
Cc: Lian Jiang 
Subject: Re: Why is GroupBy involved in the file save operation?



On Fri, May 21, 2021 at 4:35 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Reuven thanks for your response.  GroupBy is not involved if we are not 
specifying fixed number of files, correct?

Correct.

And what’s the implication of not specifying the shard number? Is the 
parallelism determined by the number of spark executors that hold data to save? 
This is assuming we are using spark runner.

Each bundle becomes a file. I'm not entirely sure how the spark runner 
determines what the bundles should be.


What would be the best practice? Specifying the fixed shard number or asking 
beam to figure it out for us?

From: Reuven Lax mailto:re...@google.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Friday, May 21, 2021 at 4:27 PM
To: user mailto:user@beam.apache.org>>
Cc: Lian Jiang mailto:li...@zillowgroup.com>>
Subject: Re: Why is GroupBy involved in the file save operation?

What you describe is what happens (at least in the Dataflow runner) if auto 
sharding is specified in batch. This mechanism tries to split the PColllection 
to fully utilize every worker, so is not appropriate when a fixed number of 
shards is desired. A GroupByKey is also necessary in streaming in order to 
split an unbounded PColllection using windows/triggers, as windows and triggers 
are applied during GroupByKey.

On Fri, May 21, 2021 at 4:16 PM Tao Li 
mailto:t...@zillow.com>> wrote:
Hi Beam community,

I wonder why a GroupBy operation is involved in WriteFiles: 
https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FWriteFiles.html&data=04%7C01%7Ctaol%40zillow.com%7C14cae9f29bdb426c581708d91cb29a90%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572375685455902%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iwNT3K0n7Mn5dxv%2BQIYEzYqdWdkyzkqQaYKC24oz6N4%3D&reserved=0>

This doc mentioned “ The exact parallelism of the write stage can be controlled 
using 
withNumShards(int)<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2FWriteFiles.html%23withNumShards-int-&data=04%7C01%7Ctaol%40zillow.com%7C14cae9f29bdb426c581708d91cb29a90%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572375685465861%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sHJYRVHJRbfpcfzw4S0stpuOdpQBKC0Bm3Oo%2F3AHqfM%3D&reserved=0>,
 typically used to control how many files are produced or to globally limit the 
number of workers connecting to an external service. However, this option can 
often hurt performance: it adds an additional 
GroupByKey<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.29.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Ftransforms%2FGroupByKey.html&data=04%7C01%7Ctaol%40zillow.com%7C14cae9f29bdb426c581708d91cb29a90%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637572375685465861%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ns%2FM4qhIkZyZPT%2F9rLXCU1LZZBdJKkJf6jg3IxforLY%3D&reserved=0>
 to the pipeline.”

When we are saving the PCollection into multiple files, why can’t we simply 
split the PCollection without a key and save each split as a file?

Thanks!


How to specify a spark config with Beam spark runner

2021-06-09 Thread Tao Li
Hi Beam community,

We are trying to specify a spark config 
“spark.hadoop.fs.s3a.canned.acl=BucketOwnerFullControl” in the spark-submit 
command for a beam app. I only see limited spark options supported according to 
this doc: https://beam.apache.org/documentation/runners/spark/

How can we specify an arbitrary spark config? Please advise. Thanks!




Re: How to specify a spark config with Beam spark runner

2021-06-17 Thread Tao Li
Hi Alexey,

Thanks we will give it a try.

From: Alexey Romanenko 
Reply-To: "user@beam.apache.org" 
Date: Thursday, June 10, 2021 at 5:14 AM
To: "user@beam.apache.org" 
Subject: Re: How to specify a spark config with Beam spark runner

Hi Tao,

"Limited spark options”, that you mentioned, are Beam's application arguments 
and if you run your job via "spark-submit" you should still be able to 
configure Spark application via normal spark-submit “--conf key=value” CLI 
option.
Doesn’t it work for you?

—
Alexey


On 10 Jun 2021, at 01:29, Tao Li mailto:t...@zillow.com>> 
wrote:

Hi Beam community,

We are trying to specify a spark config 
“spark.hadoop.fs.s3a.canned.acl=BucketOwnerFullControl” in the spark-submit 
command for a beam app. I only see limited spark options supported according to 
this doc: 
https://beam.apache.org/documentation/runners/spark/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Frunners%2Fspark%2F&data=04%7C01%7Ctaol%40zillow.com%7Ca97c13493d904a366c0308d92c094b53%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637589240693166504%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=l%2BWgkIYG%2BVM9j7z0PXMKj1ybNL51E%2F2%2BmTUgD4dkeuc%3D&reserved=0>

How can we specify an arbitrary spark config? Please advise. Thanks!



Re: Beam Calcite SQL SparkRunner Performance

2021-07-08 Thread Tao Li
That makes sense. Thanks Alexey!

From: Alexey Romanenko 
Date: Tuesday, July 6, 2021 at 10:33 AM
To: Tao Li 
Cc: Yuchu Cao , "user@beam.apache.org" 
Subject: Re: Beam Calcite SQL SparkRunner Performance

I think it’s quiet expected since Spark may push down the SQL query (or some 
parts of the query) to IO or/and RDD level and apply different type of 
optimisations there, whereas Beam SQL translates an SQL query into the general 
Beam pipeline which then is translated by SparkRunner into Spark pipeline (in 
your case).

So, potentially we can also have some push-downs here, like Schema projection 
that we already have for ParquetIO. I believe that “filters" can be the next 
step but joins could be tricky since now they are based on other Beam 
PTransforms.

—
Alexey


On 6 Jul 2021, at 04:39, Tao Li mailto:t...@zillow.com>> wrote:

@Alexey Romanenko<mailto:aromanenko@gmail.com> do you have any thoughts on 
this issue? Looks like the dag compiled by “Beam on Spark” has many more stages 
than native spark, which results in more shuffling and thus longer processing 
time.

From: Yuchu Cao mailto:yuc...@trulia.com>>
Date: Monday, June 28, 2021 at 8:09 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Cc: Tao Li mailto:t...@zillow.com>>
Subject: Beam Calcite SQL SparkRunner Performance

Hi Beam community,

We are trying to compare performance of Beam SQL on Spark with native Spark. 
The query that used for the comparison is below. The nexmark_bid schema is in 
parquet format and file size is about 35GB.
SELECT auction, price FROM nexmark_bid WHERE auction = 1007 OR auction = 1020 
OR auction = 2001 OR auction = 2019 OR auction = 1087

And we noticed that the Beam Spark jobs execution had 16 stages in total, while 
Native spark job only had 2 stages; and the native Spark job is 7 times faster 
than Beam Spark job with the same resource allocation settings in spark-submit 
commands.

Any reason why Beam Spark job execution created more stages and 
mapPartitionRDDs than native Spark? Can the performance of such query be 
improved in any way ? Thank you!

Beam Spark job stages and stage 11 DAG:






Native Spark job stages and stage 1 DAG:





Re: [Question] Snowflake IO cross account s3 write

2021-07-20 Thread Tao Li
Can someone help with this issue? It’s a blocker for us to use Beam for 
snowflake IO.

Thanks so much!

From: Anuj Gandhi 
Reply-To: "user@beam.apache.org" 
Date: Friday, July 16, 2021 at 12:07 PM
To: "user@beam.apache.org" 
Cc: "Tao Li (@taol)" <_git...@zillowgroup.com>
Subject: [Question] Snowflake IO cross account s3 write

Hi team,

I’m using Snowflake IO plugin to write to Snowflake on Spark runner. I’m using 
S3 bucket as staging bucket. The bucket is set up in a different account. I 
want to set s3 objects acl to bucket-owner-full-control while writing.

  1.  Do you have a status update on ticket [1]? Is it possible to prioritize 
it?
  2.  Is there a way to force Snowflake IO to use Hadoop s3 connector instead 
of using S3FileSystem? We have  acl settings set up in hadoop configs on the 
spark cluster.

[1]
https://issues.apache.org/jira/browse/BEAM-10850


Re: Perf issue with Beam on spark (spark runner)

2021-08-05 Thread Tao Li
Hi Alexey,

It was a great presentation!

Regarding my perf testing, I was not doing aggregation, filtering, projection 
or joining. I was simply reading all the fields of parquet and then immediately 
save PCollection back to parquet.

Regarding SDF translation, is it enabled by default?

I will check out ParquetIO splittable. Thanks!

From: Alexey Romanenko 
Date: Thursday, August 5, 2021 at 6:40 AM
To: Tao Li 
Cc: "user@beam.apache.org" , Andrew Pilloud 
, Ismaël Mejía , Kyle Weaver 
, Yuchu Cao 
Subject: Re: Perf issue with Beam on spark (spark runner)

It’s very likely that Spark SQL may have much better performance because of SQL 
push-downs and avoiding additional ser/deser operations.

In the same time, did you try to leverage "withProjection()” in ParquetIO and 
project only the fields that you needed?

Did you use ParquetIO splittable (it's not enabled by default, fixed in [1])?

Also, using SDF translation for Read on Spark Runner can cause performance 
degradation as well (we noticed that in our experiments). Try to use non-SDF 
read (if not yet) [2]


PS: Yesterday, on Beam Summit, we (Ismael and me) gave a related talk. I’m not 
sure if a recording is already available but you can find the slides here [3] 
that can be helpful.


—
Alexey

[1] 
https://issues.apache.org/jira/browse/BEAM-12070<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12070&data=04%7C01%7Ctaol%40zillow.com%7Cc36172d0b4894ac802b708d958168457%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637637676001682824%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Yq%2FODFNPo7XncHKExNDRBw6qRH2HSrymTcSGGRRWICs%3D&reserved=0>
[2] 
https://issues.apache.org/jira/browse/BEAM-10670<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-10670&data=04%7C01%7Ctaol%40zillow.com%7Cc36172d0b4894ac802b708d958168457%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637637676001682824%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ABQA4rB%2BeiMHIGdXQKiADS93F9%2F3bUfn4%2BCRRr4dgVI%3D&reserved=0>
[3] 
https://drive.google.com/file/d/17rJC0BkxpFFL1abVL01c-D0oHvRRmQ-O/view?usp=sharing<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F17rJC0BkxpFFL1abVL01c-D0oHvRRmQ-O%2Fview%3Fusp%3Dsharing&data=04%7C01%7Ctaol%40zillow.com%7Cc36172d0b4894ac802b708d958168457%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637637676001692781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2Fj0Qeibje5jk0Hiz9x57Pa92mRTyzvmTf63hOrNCPZ4%3D&reserved=0>



On 5 Aug 2021, at 03:07, Tao Li mailto:t...@zillow.com>> wrote:

@Alexey Romanenko<mailto:aromanenko@gmail.com> @Ismaël 
Mejía<mailto:ieme...@gmail.com> I assume you are experts on spark runner. Can 
you please take a look at this thread and confirm this jira covers the causes 
https://issues.apache.org/jira/browse/BEAM-12646<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cc36172d0b4894ac802b708d958168457%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637637676001692781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=c23T9dKc0muC7sRWrsAYrewA4QKAUSc6tOAwe9kRfC4%3D&reserved=0>
 ?

This perf issue is currently a blocker to me..

Thanks so much!

From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Date: Friday, July 30, 2021 at 3:53 PM
To: Andrew Pilloud mailto:apill...@google.com>>, 
"user@beam.apache.org<mailto:user@beam.apache.org>" 
mailto:user@beam.apache.org>>
Cc: Kyle Weaver mailto:kcwea...@google.com>>, Yuchu Cao 
mailto:yuc...@trulia.com>>
Subject: Re: Perf issue with Beam on spark (spark runner)

Thanks everyone for your help.

We actually did another round of perf comparison between Beam (on spark) and 
native spark, without any projection/filtering in the query (to rule out the 
“predicate pushdown” factor).

The time spent on Beam with spark runner is still taking 3-5x period of time 
compared with native spark, and the cause 
ishttps://issues.apache.org/jira/browse/BEAM-12646<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cc36172d0b4894ac802b708d958168457%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637637676001702736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LXb2NFUuF3BKkUX6m6rAdMJ%2B04e8WjxPNcDVn4zibl8%3D&reserved=0>