Calcite/BeamSql

2022-01-10 Thread Yushu Yao
Hi Folks, Question from a Newbie for both Calcite and Beam: I understand Calcite can make a tree of execution plan with relational algebra and push certain operations to a "data source". And at the same time, it can allow source-specific optimizations. I also understand that Beam SQL can run Sql

Re: Calcite/BeamSql

2022-01-10 Thread Yushu Yao
tensions/create-external-table/#kafka > [3] > https://beam.apache.org/releases/javadoc/2.35.0/org/apache/beam/sdk/extensions/sql/SqlTransform.html > [4] > https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#generic-payload-handling > > On Mon, Jan 10, 2

Re: Calcite/BeamSql

2022-01-12 Thread Yushu Yao
ple, which, in its turn, already contains another > PCollection from your Kafka source. So, once you have > a PCollectionTuple with two TupleTags (from Kafka and MySql), you can apply > SqlTransform over it. > > — > Alexey > > > > > On 11 Jan 2022, at 03:54, Yushu Ya

Write S3 File with CannedACL

2022-03-08 Thread Yushu Yao
Hi Team, We have a use case that needs to add `--acl bucket-owner-full-control` whenever we want to upload a file to S3. So if we want to use aws cli, it will be: aws s3 cp myfile s3://bucket/path/file --acl bucket-owner-full-control So to do it in java code, we use (assuming aws s3 sdk v1): Ini

Re: Write S3 File with CannedACL

2022-03-09 Thread Yushu Yao
workaround for that since the related jira > issue [1] is still open. > > Side question: are you interested only in multipart version or both? > > — > Alexey > > [1] https://issues.apache.org/jira/browse/BEAM-10850 > > On 9 Mar 2022, at 00:19, Yushu Yao wrote: > >

Changing SQL within same Beam Job?

2022-03-31 Thread Yushu Yao
Hi Experts, We have a dynamically configurable list of transformations to be performed on an input stream. Each input element will be transformed by one of those transformations. Each transformation can be expressed as a SQL. Wondering can this be achieved by BeamSQL? Say, for each input, it will

RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Yushu Yao
Hi Folks, I know this is not the optimal way to use beam :-) But assume I only use the spark runner. I have a spark library (very complex) that emits a spark dataframe (or RDD). I also have an existing complex beam pipeline that can do post processing on the data inside the dataframe. However, t

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Yushu Yao
Mon, May 23, 2022 at 11:35 AM Robert Bradshaw > wrote: > >> The easiest way to do this would be to write the RDD somewhere then >> read it from Beam. >> >> On Mon, May 23, 2022 at 9:39 AM Yushu Yao wrote: >> > >> > Hi Folks, >> > >> &g

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Yushu Yao
be supported). And > dreaming a bit more, for those who need to have a mixed pipeline (e.g. > Spark + Beam) such connectors could support the push-downs of pure Spark > pipelines and then use the result downstream in Beam. > > — > Alexey > > > > Brian > >

Metrics in Beam+Spark

2022-07-14 Thread Yushu Yao
Hi Team, Does anyone have a working example of a beam job running on top of spark? So that I can use the beam metric syntax and the metrics will be shipped out via spark's infra? The only thing I achieved is to be able to queryMetrics() every half second and copy all the metrics into the spark met

Re: Metrics in Beam+Spark

2022-07-14 Thread Yushu Yao
le yet. Typically, that means > distributing such code by other means. > > > > /Moritz > > > > > > On 14.07.22, 21:26, "Yushu Yao" wrote: > > > > Hi Team, Does anyone have a working example of a beam job running on top >

Re: Metrics in Beam+Spark

2022-07-15 Thread Yushu Yao
annot be instantiated* * Caused by: java.lang.ClassNotFoundException: com.salesforce.einstein.data.platform.connectors.JmxSink* I guess this is the classpath issue you were referring to above. Any hints on how to fix it will be greatly appreciated. Thanks! -Yushu /Moritz > > > > > > On 14.07.22, 21:26, &q