Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
Hi Beam community, I am running into a problem with “org.apache.beam:beam-runners-flink-1.11:2.25.0” and “org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local testing with the flink runners in embedded mode. The problem is that I cannot save data into local files using those

Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
EjR3TV%2F8JU%3D&reserved=0> On Tue, Nov 24, 2020 at 10:19 AM Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, I am running into a problem with “org.apache.beam:beam-runners-flink-1.11:2.25.0” and “org.apache.beam:beam-runners-flink-1.10:2.25.0”. I am doing some local test

Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xjuSIBiVj4nzvjpU0rhN54IxAb9X3ewJwUpvZ8scjP8%3D&reserved=0> On Tue, Nov 24, 2020 at 2:07 PM Tao Li mailto:t...@zillow.com>> wrote: Yep it works with “--experiments=use_deprecated_read”. Is this a regression? From:

Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Tao Li
Correction: And as discussed with Kyle, adding “--experiments=use_deprecated_read” worked with 2.25. From: Tao Li Date: Tuesday, November 24, 2020 at 3:19 PM To: "user@beam.apache.org" , Kyle Weaver , Boyuan Zhang Subject: Re: Potential issue with the flink runner in streaming mo

Re: Potential issue with the flink runner in streaming mode

2020-11-25 Thread Tao Li
Zhang Date: Tuesday, November 24, 2020 at 3:27 PM To: Tao Li Cc: "user@beam.apache.org" , Kyle Weaver Subject: Re: Potential issue with the flink runner in streaming mode And is it a batch pipeline or a streaming pipeline? On Tue, Nov 24, 2020 at 3:25 PM Tao Li mailto:t...@zillow.c

Re: Potential issue with the flink runner in streaming mode

2020-11-30 Thread Tao Li
Thanks Boyuan! From: Boyuan Zhang Date: Wednesday, November 25, 2020 at 10:52 AM To: Tao Li Cc: "user@beam.apache.org" , Kyle Weaver Subject: Re: Potential issue with the flink runner in streaming mode Thanks for reporting this issue, Tao. That's all I need for debugging

Quick question regarding production readiness of ParquetIO

2020-11-30 Thread Tao Li
Hi Beam community, According to this link the ParquetIO is still considered experimental: https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html Does it mean it’s not yet ready for prod usage? If that’s the case, when will it be ready? Also, is there any

Quick question about KafkaIO.Write

2020-12-08 Thread Tao Li
Hi Beam community, I got a quick question about withValueSerializer() method of KafkaIO.Write class: https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/kafka/KafkaIO.Write.html The withValueSerializer method does not support passing in a serializer provider. The problem wit

Re: Quick question about KafkaIO.Write

2020-12-10 Thread Tao Li
hema Registry in advance by yourself with SchemaRegistryClient to create an Avro record for write (e.g. GenericRecord) and then set KafkaAvroSerializer as ValueSerializer and specify “schema.registry.url” in producer properties. On 8 Dec 2020, at 20:59, Tao Li mailto:t...@zillow.com>> wrote: H

Question regarding GoupByKey operator on unbounded data

2020-12-10 Thread Tao Li
Hi Beam community, I got a quick question about GoupByKey operator. According to this doc, if we are using unbounded PCollection, it’s required to specify either non-global windowing

Re: Question regarding GoupByKey operator on unbounded data

2020-12-11 Thread Tao Li
Ying-Chang Cheng Subject: Re: Question regarding GoupByKey operator on unbounded data Can you explain more about what exactly you are trying to do? On Thu, Dec 10, 2020 at 2:51 PM Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, I got a quick question about GoupByKey operator.

Re: Question regarding GoupByKey operator on unbounded data

2020-12-11 Thread Tao Li
. Thanks! From: Tao Li Reply-To: "user@beam.apache.org" Date: Friday, December 11, 2020 at 10:29 AM To: "user@beam.apache.org" , Reuven Lax Cc: Mehmet Emre Sahin , Ying-Chang Cheng Subject: Re: Question regarding GoupByKey operator on unbounded data Hi @Reuven Lax<

Re: Question regarding GoupByKey operator on unbounded data

2020-12-12 Thread Tao Li
Sorry I think I had some misunderstanding on keyBy API from Flink. It’s not exactly equivalent to GroupByKey from Beam. So please ignore my question and this email thread. Thanks for help though 😊 From: Tao Li Date: Friday, December 11, 2020 at 7:29 PM To: "user@beam.apache.org" ,

Re: Question regarding GoupByKey operator on unbounded data

2020-12-16 Thread Tao Li
1840e7a5883784ac50e16f%7C0%7C0%7C637435371345143676%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4YxlLQZj77s4dQnUVO%2BQ0mbyd8qOLTtejBTG93jCOCg%3D&reserved=0> On 12/13/20 6:27 AM, Tao Li wrote: Sorry I think I had some misundersta

Quick question regarding ParquetIO

2021-01-06 Thread Tao Li
Hi beam community, Quick question about ParquetIO. Is there a way to avoid specifying the avro schema when reading parquet files? The reason is that we may not know the parquet schema until we read th

Re: Quick question regarding ParquetIO

2021-01-06 Thread Tao Li
0e7a5883784ac50e16f%7C0%7C0%7C637455536115879373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pLjqharsCRGvC7%2FJNPtOwMBAsXbNfujs%2BCnbbew0MLA%3D&reserved=0> Regards, Alexey On 6 Jan 2021, at 18:57, Tao Li mailto:t...@zillow.com>>

Re: Quick question regarding ParquetIO

2021-01-06 Thread Tao Li
p;sdata=boTq%2FeTLXfx%2FBxntkU1%2Fateg0OC5K5N20DGF9cIUclQ%3D&reserved=0> Regards, Alexey On 6 Jan 2021, at 18:57, Tao Li mailto:t...@zillow.com>> wrote: Hi beam community, Quick question about ParquetIO<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.

Re: Quick question regarding ParquetIO

2021-01-07 Thread Tao Li
instance and then get the schema attached to it? Thanks! From: Brian Hulette Date: Thursday, January 7, 2021 at 9:38 AM To: Tao Li Cc: "user@beam.apache.org" Subject: Re: Quick question regarding ParquetIO On Wed, Jan 6, 2021 at 11:07 AM Tao Li mailto:t...@zillow.com>> wrote: H

Re: Quick question regarding ParquetIO

2021-01-07 Thread Tao Li
: Thursday, January 7, 2021 at 9:56 AM To: "user@beam.apache.org" Subject: Re: Quick question regarding ParquetIO If you want to get just a PCollection as output then you would still need to set AvroCoder, but which schema to use in this case? On 6 Jan 2021, at 19:53, Tao Li mailto:t...@zil

Re: Quick question regarding ParquetIO

2021-01-08 Thread Tao Li
ion as output pcollection since it can’t infer a Coder based on provided “parseFn”. I guess it was done intentially in this way and I doubt that we can have a proper coder for PCollection without knowing a schema. Maybe some Avro experts here can add more on this if we can somehow overcome

Is there an array explode function/transform?

2021-01-12 Thread Tao Li
Hi community, Is there a beam function to explode an array (similarly to spark sql’s explode())? I did some research but did not find anything. BTW I think we can potentially use FlatMap to implement the explode functionality, but a Beam provided function would be very handy. Thanks a lot!

Re: Is there an array explode function/transform?

2021-01-12 Thread Tao Li
12, 2021 at 2:04 PM To: user Subject: Re: Is there an array explode function/transform? Have you tried Flatten.iterables On Tue, Jan 12, 2021, 2:02 PM Tao Li mailto:t...@zillow.com>> wrote: Hi community, Is there a beam function to explode an array (similarly to spark sql’s explode())? I

Re: Quick question regarding ParquetIO

2021-01-13 Thread Tao Li
PIs for python SDK here I think. Kobe On Fri, Jan 8, 2021 at 9:34 AM Tao Li mailto:t...@zillow.com>> wrote: Thanks Alexey for your explanation. That’s also what I was thinking. Parquet files already have the schema built in, so it might be feasible to infer a coder automatically (like

Re: Is there an array explode function/transform?

2021-01-13 Thread Tao Li
=IjXWhmHTGsbpgbxa1gJ5LcOFI%2BoiGIDYBwXPnukQfxk%3D&reserved=0> How is it different? It'd help if you could provide the signature (input and output PCollection types) of the transform you have in mind. On Tue, Jan 12, 2021 at 4:49 PM Tao Li mailto:t...@zillow.com>> wrote: @Reuven

Regarding the field ordering after Select.Flattened transform

2021-01-19 Thread Tao Li
Hi community, I have been experimenting with Select.Flattened transform and noticed that the field order in the flattened schema is not consistent with the order of the top level fields from the original schema. For example, in the original schema, we have field “foo” as the first field and it

Re: Quick question regarding ParquetIO

2021-01-19 Thread Tao Li
utlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-11650&data=04%7C01%7Ctaol%40zillow.com%7C7cc9c01c692c4c00b8b108d8bba2ea2a%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637465655684451869%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Uim3XsNtF

Re: Regarding the field ordering after Select.Flattened transform

2021-01-20 Thread Tao Li
mend against relying on it. Instead fields should be addressed by name, either with Row.getValue(Sring), or by mapping to a user type. Is there a reason you want to rely on a particular field order? Maybe when writing to certain IOs field order could be important. On Tue, Jan 19, 2021 at 1:36

Overwrite support from ParquetIO

2021-01-25 Thread Tao Li
Hi Beam community, Does ParquetIO support an overwrite behavior when saving files? More specifically, I would like to wipe out all existing parquet files before a write operation. Is there a ParquetIO API to support that? Thanks!

Re: Overwrite support from ParquetIO

2021-01-27 Thread Tao Li
es. On 25 Jan 2021, at 19:10, Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, Does ParquetIO support an overwrite behavior when saving files? More specifically, I would like to wipe out all existing parquet files before a write operation. Is there a ParquetIO API to support that? Thanks!

Re: Overwrite support from ParquetIO

2021-01-27 Thread Tao Li
Date: Wednesday, January 27, 2021 at 3:45 PM To: user Cc: Alexey Romanenko Subject: Re: Overwrite support from ParquetIO On Wed, Jan 27, 2021 at 12:06 PM Tao Li mailto:t...@zillow.com>> wrote: @Alexey Romanenko<mailto:aromanenko@gmail.com> thanks for your response. Regarding your

Potential bug with ParquetIO.read when reading arrays

2021-01-28 Thread Tao Li
Hi Beam community, I am seeing an error when reading an array field using ParquetIO. I was using beam 2.25 and the direct runner for testing. Is this a bug or a known issue? Am I missing anything here? Please help me root cause this issue. Thanks so much! Attached are the avro schema and the pa

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-28 Thread Tao Li
BTW I tried avro 1.8 and 1.9 and both have the same error. So we can probably rule out any avro issue. From: Tao Li Reply-To: "user@beam.apache.org" Date: Thursday, January 28, 2021 at 9:07 AM To: "user@beam.apache.org" Subject: Potential bug with ParquetIO.read when rea

Re: Overwrite support from ParquetIO

2021-01-28 Thread Tao Li
’t think it will clean up the output directory before write. Though, if there are the name collisions between existing and new output files (it depends on used naming strategy) then I think the old files will be overwritten by new ones. On 25 Jan 2021, at 19:10, Tao Li mailto:t...@zillow.c

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
Hi community, Can someone take a look at this issue? It is kind of a blocker to me right now. Really appreciate your help! From: Tao Li Reply-To: "user@beam.apache.org" Date: Thursday, January 28, 2021 at 6:13 PM To: "user@beam.apache.org" Subject: Re: Potential bug with

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
basically using parquet jars from spark distributable directly and now everything is compatible. From: Tao Li Reply-To: "user@beam.apache.org" Date: Friday, January 29, 2021 at 7:45 AM To: "user@beam.apache.org" Subject: Re: Potential bug with ParquetIO.read when reading arrays

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
1 at 10:53 AM To: user Subject: Re: Potential bug with ParquetIO.read when reading arrays Thanks. It might be something good to document in case other users run into this as well. Can you file a JIRA with the details ? On Fri, Jan 29, 2021 at 10:47 AM Tao Li mailto:t...@zillow.com>> wrote: OK

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Tao Li
Thanks @Chamikara Jayalath<mailto:chamik...@google.com> I created this jira: https://issues.apache.org/jira/browse/BEAM-11721 From: Chamikara Jayalath Date: Friday, January 29, 2021 at 2:47 PM To: Tao Li Cc: "user@beam.apache.org" Subject: Re: Potential bug with ParquetIO.r

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-30 Thread Tao Li
the schema? I briefly looked at the ParquetIO source code but has not figured it out yet. From: Tao Li Reply-To: "user@beam.apache.org" Date: Friday, January 29, 2021 at 3:37 PM To: Chamikara Jayalath , "user@beam.apache.org" Subject: Re: Potential bug with ParquetIO.read when r

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-30 Thread Tao Li
chema, is it possible to make the avro schema specification optional for ParquetIO.read? Thanks! From: Tao Li Reply-To: "user@beam.apache.org" Date: Saturday, January 30, 2021 at 1:54 PM To: "user@beam.apache.org" Subject: Re: Potential bug with ParquetIO.read when reading arr

Re: Potential bug with ParquetIO.read when reading arrays

2021-02-03 Thread Tao Li
} ], "default": null } ] } You can see that the schema becomes an array of record type (which contains a "element" field). The reason is probably that internally spark parquet is defining a “list” record type. The problem is that this avro schema i

Re: Potential bug with ParquetIO.read when reading arrays

2021-02-03 Thread Tao Li
I am also wondering if leveraging this parquet setting "parquet.avro.add-list-element-records" along with BEAM-11527<https://issues.apache.org/jira/browse/BEAM-11527> can solve my problem... From: Tao Li Reply-To: "user@beam.apache.org" Date: Wednesday, February 3,

Regarding Beam 2.28 release timeline

2021-02-17 Thread Tao Li
Hi Beam community, I am looking forward to Beam 2.28 release which will probably include BEAM-11527. We will depend on BEAM-11527 for a major work item from my side. Can some please provide an ET

Re: Apache Beam's UX Research Findings Readout

2021-02-18 Thread Tao Li
Hi @Carlos Camacho is there a recording of this meeting? Thanks! From: Carlos Camacho Reply-To: "user@beam.apache.org" Date: Thursday, February 11, 2021 at 9:06 AM To: "user@beam.apache.org" Subject: Apache Beam's UX Research Findings Readout Hi everyone, T

Potential bug with BEAM-11460?

2021-02-23 Thread Tao Li
Hi Beam community, I cannot log into Beam jira so I am asking this question here. I am testing this new feature from Beam 2.28 and see below error: Exception in thread "main" java.lang.IllegalArgumentException: Unable to infer coder for output of parseFn. Specify it explicitly using withCoder()

How to specify "fs.s3.enableServerSideEncryption" in Beam

2021-02-24 Thread Tao Li
Hi Beam community, We need to specify the "fs.s3.enableServerSideEncryption" setting when saving parquet files to s3. This doc describes this setting https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-emrfs-encryption.html What would be the recommended way to set that? Please advise. Th

Re: How to specify "fs.s3.enableServerSideEncryption" in Beam

2021-02-24 Thread Tao Li
Just found this https://github.com/apache/beam/blob/master/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/options/S3Options.java#L70 Is this the right approach? Thanks! From: Tao Li Reply-To: "user@beam.apache.org" Date: Wednesday, February 24, 2021 at

Re: Potential bug with BEAM-11460?

2021-02-25 Thread Tao Li
a08d8d9e2b033%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637498914935974981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R87jcebbsRfqpnOBYY%2F8YYD5Hd82GK9EGGUFyfjSO7s%3D&reserved=0>? What would be the right way to fix this? On Tue, Feb 23, 2021 at 5:24 PM Tao Li m

Re: Potential bug with BEAM-11460?

2021-02-26 Thread Tao Li
Thanks @Anant Damle<mailto:ana...@google.com> for fixing the issue with BEAM-11460 and BEAM-11527 so quickly! From: Anant Damle Date: Friday, February 26, 2021 at 6:49 AM To: Tao Li Cc: "user@beam.apache.org" , Brian Hulette Subject: Re: Potential bug with BEAM-11460? @Tao

Re: Potential bug with ParquetIO.read when reading arrays

2021-02-26 Thread Tao Li
ant Damle<mailto:ana...@google.com> @Brian Hulette<mailto:bhule...@google.com> thanks so much for your support and help! From: Tao Li Reply-To: "user@beam.apache.org" Date: Wednesday, February 3, 2021 at 10:51 PM To: "user@beam.apache.org" Subject: Re: Potenti

Regarding the over window query support from Beam SQL

2021-03-01 Thread Tao Li
Hi Beam community, Querying over a window for ranking etc is pretty common in SQL use cases. I have found this jira https://issues.apache.org/jira/browse/BEAM-9198 Do we have a plan to support this? If there is no such plan in near future, are Beam developers supposed to implement this function

Re: Regarding the over window query support from Beam SQL

2021-03-02 Thread Tao Li
+ Rui Wang. Looks like Rui has been working on this jira. From: Tao Li Date: Monday, March 1, 2021 at 9:51 PM To: "user@beam.apache.org" Subject: Regarding the over window query support from Beam SQL Hi Beam community, Querying over a window for ranking etc is pretty common in SQL

A problem with ZetaSQL

2021-03-02 Thread Tao Li
Hi all, I was following the instructions from this doc to play with ZetaSQL https://beam.apache.org/documentation/dsls/sql/overview/ The query is really simple: options.as(BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner") input.apply

Re: A problem with ZetaSQL

2021-03-02 Thread Tao Li
e.org" Date: Tuesday, March 2, 2021 at 10:31 AM To: user Subject: Re: A problem with ZetaSQL Thanks for reporting this Tao - could you share what the type of your input PCollection is? On Tue, Mar 2, 2021 at 9:33 AM Tao Li mailto:t...@zillow.com>> wrote: Hi all, I was following

Re: Regarding the over window query support from Beam SQL

2021-03-02 Thread Tao Li
rder by [1] range between UNBOUNDED PRECEDING and CURRENT ROW aggs [RANK()])]) BeamIOSourceRel(table=[[beam, PCOLLECTION]]) From: Rui Wang Date: Tuesday, March 2, 2021 at 10:43 AM To: Tao Li Cc: "user@beam.apache.org" Subject: Re: Regarding the over window query support

Does writeDynamic() support writing different element groups to different output paths?

2021-03-03 Thread Tao Li
Hi Beam community, I have a streaming app that writes every hour’s data to a folder named with this hour. With Flink (for example), we can leverage “Bucketing File Sink”: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/filesystem_sink.html However I am not seeing Bea

Re: Does writeDynamic() support writing different element groups to different output paths?

2021-03-04 Thread Tao Li
system path separator, e.g. S3. On Wed, Mar 3, 2021 at 5:36 PM Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, I have a streaming app that writes every hour’s data to a folder named with this hour. With Flink (for example), we can leverage “Bucketing File Sink”: https://ci.ap

Re: Does writeDynamic() support writing different element groups to different output paths?

2021-03-04 Thread Tao Li
, it works fine. Does anyone have any ideas? Thanks! inputData.apply(FileIO.write() .withNumShards(10) .via(ParquetIO.sink(inputAvroSchema)) .to(outputPath) .withSuffix(".parquet")); From: Tao Li Reply-To: "user@beam

Re: Does writeDynamic() support writing different element groups to different output paths?

2021-03-04 Thread Tao Li
tion.java:392) at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getSchemaInformation(ParDoTranslation.java:377) at org.apache.beam.runners.direct.ParDoEvaluatorFactory.forApplication(ParDoEvaluatorFactory.java:87) From: Tao Li Reply-To: "user@beam.apache.org

Re: A problem with ZetaSQL

2021-03-04 Thread Tao Li
Brian the schema is really simple. Just 3 primitive type columns: root |-- column_1: integer (nullable = true) |-- column_2: integer (nullable = true) |-- column_3: string (nullable = true) From: Brian Hulette Date: Thursday, March 4, 2021 at 2:29 PM To: Tao Li Cc: "user@beam.apach

Re: A problem with ZetaSQL

2021-03-05 Thread Tao Li
Robin/Brian, I see. Thanks so much for your help! From: Robin Qiu Date: Friday, March 5, 2021 at 12:31 AM To: Brian Hulette Cc: Tao Li , "user@beam.apache.org" Subject: Re: A problem with ZetaSQL Hi Tao, In ZetaSQL all "integers" are 64 bits. So if your integers in co

Re: Regarding the over window query support from Beam SQL

2021-03-05 Thread Tao Li
Hi Rui, Just following up on this issue. Do you think this is a bug? Is there a workaround? Thanks! From: Tao Li Reply-To: "user@beam.apache.org" Date: Tuesday, March 2, 2021 at 3:37 PM To: Rui Wang Cc: "user@beam.apache.org" Subject: Re: Regarding the over window que

Re: Regarding the over window query support from Beam SQL

2021-03-05 Thread Tao Li
rther investigate why the alias did not correctly pass through. The entry point is to investigate from BeamWindowRel. -Rui On Fri, Mar 5, 2021 at 10:20 AM Tao Li mailto:t...@zillow.com>> wrote: Hi Rui, Just following up on this issue. Do you think this is a bug? Is there a workaround?

Re: Regarding the over window query support from Beam SQL

2021-03-11 Thread Tao Li
Result: [1, 200, 2] [1, 100, 1] While I expect the result to be: [1, 200, 1] [1, 100, 2] So basically the rank result (by using desc order) seems incorrect to me. Can you please take a look at this issue? Thanks! From: Tao Li Reply-To: "user@beam.apache.org" Date: Friday, March 5,

How to add/alter/drop a Hive partition from a Beam app

2021-03-12 Thread Tao Li
Hi Beam community, I am wondering how we can use some Beam APIs or Beam SQL to perform some Hive DDL operations such as add/alter/drop a partition. I guess I might need to use HCatalogIO, however I am not sure about what syntax to use. Please advise. Thanks!

Is there a perf comparison between Beam (on spark) and native Spark?

2021-03-22 Thread Tao Li
Hi Beam community, I am wondering if there is a doc to compare perf of Beam (on Spark) and native spark for batch processing? For example using TPCDS benmark. I did find some relevant links like this

Re: Is there a perf comparison between Beam (on spark) and native Spark?

2021-03-25 Thread Tao Li
V2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ibmzJ3cPSHzDjVPBR4A5jTQTs2O2obmh%2FDQG2X3UBSg%3D&reserved=0> On 22 Mar 2021, at 18:00, Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, I am wondering if there is a doc to compare perf of Beam (on Spark) and native spark for

Does Beam DynamoDBIO support DynamoDB Streams?

2021-04-03 Thread Tao Li
Hi Beam community, Does Beam DynamoDBIO support ingesting DynamoDB Streams? Thanks!

Any ETA of flink 1.12.2 support?

2021-04-12 Thread Tao Li
Hi Beam community, Beam 2.28.0 supports flink 1.12.1 for flink runner. We are expecting some bug fixes from flink 1.12.2. Will flink version be upgraded to 1.12.2 with Beam 2.29? And is there an ETA for that? Thanks!

Any easy way to extract values from PCollection?

2021-04-21 Thread Tao Li
Hi Beam community, This is the question I am asking: https://stackoverflow.com/questions/28015924/how-to-extract-contents-from-pcollection-in-cloud-dataflow Thanks!

Re: Any easy way to extract values from PCollection?

2021-04-22 Thread Tao Li
ill let you collect() a PCollection - meaning it runs the pipeline and materializes the result. There's no such capability for the other SDKs though. On Wed, Apr 21, 2021 at 8:24 PM Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, This is the question I am asking: https://

Question on late data handling in Beam streaming mode

2021-04-22 Thread Tao Li
Hi Beam community, I am wondering if there is a risk of losing late data from a Beam stream app due to watermarking? I just went through this design doc and noticed the “droppable” definition there: https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit# Can you

Re: Question on late data handling in Beam streaming mode

2021-04-23 Thread Tao Li
ation to allow out of order data. Kenn On Thu, Apr 22, 2021 at 1:26 PM Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, I am wondering if there is a risk of losing late data from a Beam stream app due to watermarking? I just went through this design doc and noticed the “dro

Re: Question on late data handling in Beam streaming mode

2021-04-26 Thread Tao Li
a single event-time window that encompasses all time, and "triggering" an aggregation every 30 seconds based on processing time. On Fri, Apr 23, 2021 at 8:14 AM Tao Li mailto:t...@zillow.com>> wrote: Thanks @Kenneth Knowles<mailto:k...@apache.org>. I understand we need to specify a

Question on printing out a PCollection

2021-04-29 Thread Tao Li
Hi Beam community, The notebook console from Google Cloud defines a show() API to display a PCollection which is very neat: https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development If we are using a regular jupyter notebook to run beam app, how can we print out a PCollect

Re: Question on printing out a PCollection

2021-04-30 Thread Tao Li
> Bokeh > D3.js > > Hope this helps! > > Ning > > On Fri, Apr 30, 2021 at 9:24 AM Brian Hulette wrote: >> >> +Ning Kang +Sam Rohde >> >> On Thu, Apr 29, 2021 at 6:13 PM Tao Li wrote: >>> &g

Does SnowflakeIO support spark runner

2021-05-06 Thread Tao Li
Hi Beam community, Does SnowflakeIO support spark runner? Seems like only direct runner and dataflow runner are supported.. Thanks!

Re: Does SnowflakeIO support spark runner

2021-05-06 Thread Tao Li
: Anuj Gandhi Subject: Re: Does SnowflakeIO support spark runner As far as I know, it should be supported (Beam's abstract model means IOs usually "just work" on all runners). What makes you think it isn't supported? On Thu, May 6, 2021 at 11:52 AM Tao Li mailto:t...@zill

Re: Does SnowflakeIO support spark runner

2021-05-06 Thread Tao Li
Thanks Kyle! From: Kyle Weaver Date: Thursday, May 6, 2021 at 12:19 PM To: Tao Li Cc: "user@beam.apache.org" , Anuj Gandhi Subject: Re: Does SnowflakeIO support spark runner Yeah, I'm pretty sure that documentation is just misleading. All of the options from --runner on

A problem with calcite sql

2021-05-10 Thread Tao Li
Hi Beam community, I am seeing a weird issue by using calcite sql. I don’t understand why it’s complaining my query is not valid. Once I removed “user AS user”, it worked fine. Please advise. Thanks. Exception in thread "main" org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to p

Re: A problem with calcite sql

2021-05-10 Thread Tao Li
Never mind. Looks like “user” is a reserved name. From: Tao Li Reply-To: "user@beam.apache.org" Date: Monday, May 10, 2021 at 7:10 PM To: "user@beam.apache.org" Cc: Yuan Feng Subject: A problem with calcite sql Hi Beam community, I am seeing a weird issue by using

Re: A problem with calcite sql

2021-05-10 Thread Tao Li
at org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341) at org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:348) From: Tao Li Reply-To: "user@beam.apache.org" Date: Monday, May 10, 2021 at 7:19 PM To: "user@beam.apache.org" Cc: Yua

Re: A problem with calcite sql

2021-05-11 Thread Tao Li
w to work around it, if you can get the output type to be a VARCHAR (instead of CHAR) this problem will go away. You might be able to do something like 'CAST("Your String Literal" AS VARCHAR)' , 'TRIM("Your String Literal")' or ' "Your String Literal

Re: A problem with calcite sql

2021-05-11 Thread Tao Li
umn 'market_transactionManagement_transactionManagers.email' not found in any table So what would be the right syntax? Thanks! From: Andrew Pilloud Date: Tuesday, May 11, 2021 at 11:51 AM To: Tao Li Cc: "user@beam.apache.org" , Yuan Feng Subject: Re: A problem with calcite sql

Re: A problem with calcite sql

2021-05-12 Thread Tao Li
, column 44: Table 'PCOLLECTION' not found From: Andrew Pilloud Date: Tuesday, May 11, 2021 at 10:38 PM To: Tao Li Cc: "user@beam.apache.org" , Yuan Feng Subject: Re: A problem with calcite sql If the type was just a nested row this shoul

A problem with nexmark build

2021-05-12 Thread Tao Li
Hi Beam community, I have been following this nexmark doc: https://beam.apache.org/documentation/sdks/java/testing/nexmark/ I ran into a problem with “Running query 0 on a Spark cluster with Apache Hadoop YARN” section. I was following the instruction by running “./gradlew :sdks:java:testing:

Why is GroupBy involved in the file save operation?

2021-05-21 Thread Tao Li
Hi Beam community, I wonder why a GroupBy operation is involved in WriteFiles: https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html This doc mentioned “ The exact parallelism of the write stage can be controlled using withNumShards(int)

Re: Why is GroupBy involved in the file save operation?

2021-05-21 Thread Tao Li
cessary in streaming in order to split an unbounded PColllection using windows/triggers, as windows and triggers are applied during GroupByKey. On Fri, May 21, 2021 at 4:16 PM Tao Li mailto:t...@zillow.com>> wrote: Hi Beam community, I wonder why a GroupBy operation is involved in Wr

Re: Why is GroupBy involved in the file save operation?

2021-05-21 Thread Tao Li
g Subject: Re: Why is GroupBy involved in the file save operation? On Fri, May 21, 2021 at 4:35 PM Tao Li mailto:t...@zillow.com>> wrote: Reuven thanks for your response. GroupBy is not involved if we are not specifying fixed number of files, correct? Correct. And what’s the im

How to specify a spark config with Beam spark runner

2021-06-09 Thread Tao Li
Hi Beam community, We are trying to specify a spark config “spark.hadoop.fs.s3a.canned.acl=BucketOwnerFullControl” in the spark-submit command for a beam app. I only see limited spark options supported according to this doc: https://beam.apache.org/documentation/runners/spark/ How can we speci

Re: How to specify a spark config with Beam spark runner

2021-06-17 Thread Tao Li
, that you mentioned, are Beam's application arguments and if you run your job via "spark-submit" you should still be able to configure Spark application via normal spark-submit “--conf key=value” CLI option. Doesn’t it work for you? — Alexey On 10 Jun 2021, at 01:29, Tao Li m

Re: Beam Calcite SQL SparkRunner Performance

2021-07-08 Thread Tao Li
That makes sense. Thanks Alexey! From: Alexey Romanenko Date: Tuesday, July 6, 2021 at 10:33 AM To: Tao Li Cc: Yuchu Cao , "user@beam.apache.org" Subject: Re: Beam Calcite SQL SparkRunner Performance I think it’s quiet expected since Spark may push down the SQL query (or some pa

Re: [Question] Snowflake IO cross account s3 write

2021-07-20 Thread Tao Li
Can someone help with this issue? It’s a blocker for us to use Beam for snowflake IO. Thanks so much! From: Anuj Gandhi Reply-To: "user@beam.apache.org" Date: Friday, July 16, 2021 at 12:07 PM To: "user@beam.apache.org" Cc: "Tao Li (@taol)" <_git...@zi

Re: Perf issue with Beam on spark (spark runner)

2021-08-05 Thread Tao Li
check out ParquetIO splittable. Thanks! From: Alexey Romanenko Date: Thursday, August 5, 2021 at 6:40 AM To: Tao Li Cc: "user@beam.apache.org" , Andrew Pilloud , Ismaël Mejía , Kyle Weaver , Yuchu Cao Subject: Re: Perf issue with Beam on spark (spark runner) It’s very likely that