Re: Count(distinct) not working in beam sql

2023-11-11 Thread Talat Uyarer via user
Hi, I saw this a little bit late. I implement a custom count distinct for our streaming use case. If you are looking for something close enough but not exact you can use my UDF. It uses the HyperLogLogPlus algorithm, which is an efficient and scalable way to estimate cardinality with a controlled

Watermark Alignment on Flink Runner's UnboundedSourceWrapper

2023-05-19 Thread Talat Uyarer via user
Hi All, I have a stream aggregation job which reads from Kafka and writes some Sinks. When I submit my job Flink checkpoint size keeps increasing if I use unaligned checkpoint settings and it does not emit any window results. If I use an aligned checkpoint, size is somewhat under control(still bi

Local Combiner for GroupByKey on Flink Streaming jobs

2023-05-19 Thread Talat Uyarer via user
Hi, I have a stream aggregation job which is running on Flink 1.13 I generate DAG by using Beam SQL. My SQL query has a TUMBLE window. Basically My pipeline reads from kafka aggregate, counts/sums some values by streamin aggregation and writes a Sink. BeamSQl uses Groupbykey for the aggregation p

Re: Beam SQL Alias issue while using With Clause

2023-03-02 Thread Talat Uyarer via user
Vr&s=g36wnBGvi7DQG7gvljaG08vXIhROyCoz5vWBBRS43Ag&e= > > I'm experimenting with ideas on how to work around this but a fix will > likely require a Calcite upgrade, which is not something I'd have time > to help with. (I'm not on the Google Beam team anymore.) >

Re: Beam SQL Alias issue while using With Clause

2023-02-22 Thread Talat Uyarer via user
ovider); > +PCollection stream = > +BeamSqlRelUtils.toPCollection( > +pipeline, sqlEnv.parseQuery("WITH tempTable AS (SELECT * FROM > basicRowTestTable WHERE basicRowTestTable.col.string_field = 'innerStr') > SELECT * FROM tempTable")); > +Schema outputS

Re: Beam SQL Alias issue while using With Clause

2023-02-03 Thread Talat Uyarer via user
t;>> >>> One minor issue I encountered. It took me a while to get your test case >>> running, it doesn't appear there are any calcite gradle rules to run >>> CoreQuidemTest and constructing the classpath manually was tedious. Did I >>> miss something? >&g

Re: How to submit beam python pipeline to GKE flink cluster

2023-02-03 Thread Talat Uyarer via user
Hi, Do you use Flink operator or manually deployed session cluster ? Thanks On Fri, Feb 3, 2023, 4:32 AM P Singh wrote: > Hi Team, > > I have set up a flink cluster on GKE and am trying to submit a beam > pipeline with below options. I was able to run this on a local machine but > I don't unde

Beam Portable Framework Question for Same SDK and Runner Language

2023-02-02 Thread Talat Uyarer via user
Hi, I know we use the portability framework when the sdk language (python) is different from the runner's language(java) . If my runner is Java based and I want to use the portability framework for Java SDK. Is there any optimization on the Beam side rather than running two separate docker images

Re: Beam SQL Alias issue while using With Clause

2023-01-27 Thread Talat Uyarer via user
suspect the problem lies there. I spent some time looking but wasn't able > to find it by code inspection, it looks like this code path is doing the > right thing with names. I'll spend some time tomorrow trying to reproduce > this on pure Calcite. > > Andrew > > > O

Re: Beam SQL Alias issue while using With Clause

2023-01-24 Thread Talat Uyarer via user
; > Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build(); > assertThat(result.getSchema(), equalTo(output_schema)); > > Row output = Row.withSchema(output_schema).addValues(1, > "strstr").build(); > PAssert.that(

Re: Beam SQL Alias issue while using With Clause

2023-01-24 Thread Talat Uyarer via user
tes Java bytecode to directly execute the DSL. >> >> Problem: it looks like the CalcRel has output columns with aliases "id" >> and "v" where it should have output columns with aliases "id" and "value". >> >> Kenn >> >&

Beam SQL Alias issue while using With Clause

2023-01-12 Thread Talat Uyarer via user
Hi All, I am using Beam 2.43 with Calcite SQL with Java. I have a query with a WITH clause and some aliasing. Looks like Beam Query optimizer after optimizing my query, it drops Select statement's aliases. Can you help me to identify where the problem is ? This is my query INFO: SQL: WITH `tempT

Re: Beam slowness compared to flink-native

2022-05-10 Thread Talat Uyarer
HI Ifat, Did you enable fasterCopy parameter ? Please look at this issue: https://issues.apache.org/jira/browse/BEAM-11146 Thanks On Mon, May 2, 2022 at 12:57 AM Afek, Ifat (Nokia - IL/Kfar Sava) < ifat.a...@nokia.com> wrote: > Hi, > > > > I’m investigating a slowness of our beam pipelines, an

Re: Calculating Top K element per window with Beam SQL

2021-07-29 Thread Talat Uyarer
l_src_main_java_org_apache_beam_sdk_extensions_sql_impl_rel_BeamSortRel.java-23L200&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=t64ngtbWsSfQC4rND6Nc4ri8I--Ey30-KIuKISgxxI0&s=f5C-HFf6aqOpMxRyaDjkMGNb7PsX-puF6K4cW-b_gTI&e=> >

Calculating Top K element per window with Beam SQL

2021-07-09 Thread Talat Uyarer
Hi, I am trying to calculate the top k element based on a streaming aggregation per window. Do you know if there is any support for it on BeamSQL or How can achieve this goal with BeamSQL on stream ? Sample Query SELECT customer_id, app, sum(bytes_total) as bytes_total FROM PCOLLECTION GROUP BY c

How Beam SQL Side Input refresh/update

2021-05-07 Thread Talat Uyarer
Hi, Based on Join documentation. If I have a Join with Unbounded and Bounded > For this type of JOIN bounded input is treated as a side-input by the > implementation. This means that window/trigger is inherented from upstreams. On my pipeline I dont have any triggering or window. I use a global

Re: Apache Beam SQL and UDF

2021-02-10 Thread Talat Uyarer
t update data with any mechanisim. Thanks On Wed, Feb 10, 2021 at 12:41 PM Talat Uyarer wrote: > Does beam create UDF function for every bundle or in setup of pipeline ? > > I will keep internal state in memory. The Async thread will update that in > memory state based on an interva

Re: Apache Beam SQL and UDF

2021-02-10 Thread Talat Uyarer
:37 PM Rui Wang wrote: > The problem that I can think of is maybe before the async call is > completed, the UDF life cycle has reached to the end. > > > -Rui > > On Wed, Feb 10, 2021 at 12:34 PM Talat Uyarer < > tuya...@paloaltonetworks.com> wrote: > >> Hi,

Apache Beam SQL and UDF

2021-02-10 Thread Talat Uyarer
Hi, We plan to use UDF on our sql. We want to achieve some kind of filtering based on internal states. We want to update that internal state with a separate async thread in UDF. Before implementing that thing I want to get your options. Is there any limitation for UDF to have multi-thread implemen

Re: About Beam SQL Schema Changes and Code generation

2020-12-08 Thread Talat Uyarer
>>> >>>>>> Reuven >>>>>> >>>>>> On Tue, Dec 8, 2020 at 9:33 AM Brian Hulette >>>>>> wrote: >>>>>> >>>>>>> Reuven, could you clarify what you have in mind? I know multiple >>>>>>

Re: About Beam SQL Schema Changes and Code generation

2020-12-08 Thread Talat Uyarer
compatibility >>>>>> support to SchemaCoder, including support for certain schema changes >>>>>> (field >>>>>> additions/deletions) - I think the most recent discussion was here [1]. >>>>>> >>>>>> But it sounds

Re: About Beam SQL Schema Changes and Code generation

2020-12-08 Thread Talat Uyarer
might be theoretically possible to do this (at least for the > case where existing fields do not change). Whether anyone currently has > available time to do this is a different question, but it's something that > can be looked into. > > On Mon, Dec 7, 2020 at 9:29 PM Talat Uyar

Re: About Beam SQL Schema Changes and Code generation

2020-12-07 Thread Talat Uyarer
ges, are these new fields being added to the > schema? Or are you making changes to the existing fields? > > On Mon, Dec 7, 2020 at 9:02 PM Talat Uyarer > wrote: > >> Hi, >> For sure let me explain a little bit about my pipeline. >> My Pipeline is actually simple &g

Re: About Beam SQL Schema Changes and Code generation

2020-12-07 Thread Talat Uyarer
m unless you are > also changing the SQL statement? > > Is this a case where you have a SELECT *, and just want to make sure those > fields are included? > > Reuven > > On Mon, Dec 7, 2020 at 6:31 PM Talat Uyarer > wrote: > >> Hi Andrew, >> >> I assume

Re: About Beam SQL Schema Changes and Code generation

2020-12-07 Thread Talat Uyarer
QL to try to produce > the same plan for an updated SQL query. > > Andrew > > On Mon, Dec 7, 2020 at 5:44 PM Talat Uyarer > wrote: > >> Hi, >> >> We are using Beamsql on our pipeline. Our Data is written in Avro format. >> We generate our rows bas

About Beam SQL Schema Changes and Code generation

2020-12-07 Thread Talat Uyarer
Hi, We are using Beamsql on our pipeline. Our Data is written in Avro format. We generate our rows based on our Avro schema. Over time the schema is changing. I believe Beam SQL generates Java code based on what we define as BeamSchema while submitting the pipeline. Do you have any idea How can we

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-03 Thread Talat Uyarer
GC thrashing. > Memory is used/total/max = 4112/5994/5994 MB, GC last/max = 97.36/97.36 %, > #pushbacks=3, gc thrashing=true. Heap dump not written. And also my process rate is 4kps per instance. I would like to hear your suggestions if you have any. Thanks On Wed, Sep 2, 2020 at 6:22 PM Tal

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Talat Uyarer
I also tried Brian's suggestion to clear stringbuilder by calling delete with stringbuffer length. No luck. I am still getting the same error message. Do you have any suggestions ? Thanks On Wed, Sep 2, 2020 at 3:33 PM Talat Uyarer wrote: > If I'm understanding Talat's log

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Talat Uyarer
to-reuse-a-stringbuilder-in-a-loop >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_242438_is-2Dit-2Dbetter-2Dto-2Dreuse-2Da-2Dstringbuilder-2Din-2Da-2Dloop&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3V

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Talat Uyarer
VYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=EFsNtBMh3aQSH1MXWx0-YmpRIgUHj6EfHvulHdoBkdw&s=H5Fupf1d3R199PF73T8D8YYAnipbS3YdJKj_4Ep2-DU&e=> > > On Wed, Sep 2, 2020 at 3:02 PM Talat Uyarer > wrote: > >> Sorry for the wrong import. You can see on the code I am using >> Stri

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Talat Uyarer
AhADvjz4ucjndSmzyOZ8FPBvJ_0oZQ&e=> > > Could you try using StringBuilder instead since the usage is not > appropriate for a StringWriter? > > > On Wed, Sep 2, 2020 at 2:49 PM Talat Uyarer > wrote: > >> Hi, >> >> I have an issue with String Co

OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Talat Uyarer
Hi, I have an issue with String Concatenating. You can see my code below.[1] I have a step on my df job which is concatenating strings. But somehow when I use that step my job starts getting jvm restart errors. Shutting down JVM after 8 consecutive periods of measured GC thrashing. > Memory is u

Re: Resource Consumption increase With TupleTag

2020-08-20 Thread Talat Uyarer
? (Even if MessageExtractor seems simple it isn't free, You > have to now write to two GCS locations instead of one for each work item > that you process so your doing more network calls) > > > On Wed, Aug 19, 2020 at 8:36 PM Talat Uyarer > wrote: > >> Filter

Re: Resource Consumption increase With TupleTag

2020-08-19 Thread Talat Uyarer
you can have a single Pipeline and do as > many reads as you want and the'll all get executed in the same job. > > On Wed, Aug 19, 2020 at 4:14 PM Talat Uyarer > wrote: > > > > Hi Robert, > > > > I calculated process speed based on worker count. When I have sep

Re: Resource Consumption increase With TupleTag

2020-08-19 Thread Talat Uyarer
7pAR8g&m=Erfg03JLKLNG3lT2ejqq7_fbvfL95-wSxZ5hFKqzyKU&s=JsWPJxBXopYYenfBAp6nkwfB0Q1Dhs1d4Yi41fBY3a8&e= > ). > > - Robert > > On Wed, Aug 19, 2020 at 3:37 PM Talat Uyarer > wrote: > > > > Hi, > > > > I have a very simple DAG on my dataflow job. > (KafkaIO->Filter-&g

Resource Consumption increase With TupleTag

2020-08-19 Thread Talat Uyarer
Hi, I have a very simple DAG on my dataflow job. (KafkaIO->Filter->WriteGCS). When I submit this Dataflow job per topic it has 4kps per instance processing speed. However I want to consume two different topics in one DF job. I used TupleTag. I created TupleTags per message type. Each topic has dif

Re: KafkaIO does not support add or remove topics

2020-06-29 Thread Talat Uyarer
PJNRn4tog4I&e=> > [3] https://issues.apache.org/jira/browse/BEAM-9977 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D9977&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=RU

KafkaIO does not support add or remove topics

2020-06-29 Thread Talat Uyarer
Hi, I am using Dataflow. When I update my DF job with a new topic or update partition count on the same topic. I can not use DF's update function. Could you help me to understand why I can not use the update function for that ? I checked the Beam source code but I could not find the right place t

Re: Building Dataflow Worker

2020-06-15 Thread Talat Uyarer
eam 2.19 is using gradle 5.2.1, is the installed version > compatible with that? > > Try > ./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar > in a clean workspace. > > On Fri, Jun 12, 2020 at 4:30 PM Talat Uyarer > wrote: > >> Hi, >> >>

Building Dataflow Worker

2020-06-12 Thread Talat Uyarer
Hi, I want to build the dataflow worker on apache beram 2.19. However I faced a grpc issue. I did not change anything. Just checked release-2.19.0 branch and run build command. Could you help me understand why it does not build. [1] Additional information, Based on my limited knowledge Looks like

Re: KafkaIO Read Latency

2020-06-10 Thread Talat Uyarer
ng and closing bundle*”? > > Sometimes, starting a KafkaReader can take time since it will seek for a > start offset for each assigned partition but it happens only once at > pipeline start-up and mostly depends on network conditions. > > On 9 Jun 2020, at 23:05, Talat Uyarer > wrote

KafkaIO Read Latency

2020-06-09 Thread Talat Uyarer
Hi, I added some metrics on a step right after KafkaIO. When I compare the read time difference between producer and KafkaIO it is 800ms for P99. However somehow that step's opening and closing bundle difference is 18 seconds for p99. The step itself does not do any specific thing. Do you have any

Re: Pipeline Processing Time

2020-06-09 Thread Talat Uyarer
e.proofpoint.com/v2/url?u=https-3A__beam.apache.org_releases_javadoc_2.21.0_org_apache_beam_sdk_transforms_DoFn.ProcessElement.html&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=KuUWakZ-xaVGYfsw7YGz1WBOLIlpBHikvRxgZs9vWn0&s=nkJq_weo7lr

Re: Pipeline Processing Time

2020-06-01 Thread Talat Uyarer
Q&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=1202mTv7BP1KzcBJECS98dr7u5riw0NHdl8rT8I6Ego&s=cPdnrK4r-tVd0iAO6j7eAAbDPISOdazEYBrPoC9cQOo&e=> > > > On Thu, May 28, 2020 at 1:12 PM Talat Uyarer > wrote: > >> Yes I am trying to track how long

Re: Pipeline Processing Time

2020-05-28 Thread Talat Uyarer
essing takes? > Do you care about how much CPU time is being consumed in aggregate for all > the processing that your pipeline is doing? > > > On Thu, May 28, 2020 at 11:01 AM Talat Uyarer < > tuya...@paloaltonetworks.com> wrote: > >> I am using Dataflow Runner.

Re: Pipeline Processing Time

2020-05-28 Thread Talat Uyarer
I am using Dataflow Runner. The pipeline read from kafkaIO and send Http. I could not find any metadata field on the element to set first read time. On Thu, May 28, 2020 at 10:44 AM Kyle Weaver wrote: > Which runner are you using? > > On Thu, May 28, 2020 at 1:43 PM Talat Uyarer

Pipeline Processing Time

2020-05-28 Thread Talat Uyarer
Hi, I have a pipeline which has 5 steps. What is the best way to measure processing time for my pipeline? Thnaks

KafkaIO BackLog Elements Metrics

2020-05-09 Thread Talat Uyarer
Hi, I want to get Kafka's backlog metrics. In apache beam code I saw beam is collecting that metrics in here[1] as Source Metrics. However I can not see those metrics on Dataflow's metrics explorer. Do you know is there anyway to get those metrics ? Also I saw there is MetricsSink. But based on b

Re: Beam SQL Nested Rows are flatten by Calcite

2020-02-24 Thread Talat Uyarer
ed > Row structure (as they were flattened in LogicalPlan). > > Depends on how that patch works. Nested row might not immediately work > after you apply that patch. > > > -Rui > > On Fri, Feb 14, 2020 at 3:14 PM Talat Uyarer > wrote: > >> Do you mean they were f

Re: Beam SQL Nested Rows are flatten by Calcite

2020-02-14 Thread Talat Uyarer
ttps-3A__jira.apache.org_jira_browse_CALCITE-2D3138&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=kPxDNZSy_WpbC0xfVKTFpbSnpFAdhwMZYhSq9L-8H0g&s=jliQ_5N9_-n0EN1qXNmzeBX4m8Xhdcv_UtaHQ812L9Y&e=> >>> >>> >>&

Re: Beam SQL Nested Rows are flatten by Calcite

2020-02-14 Thread Talat Uyarer
L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=kPxDNZSy_WpbC0xfVKTFpbSnpFAdhwMZYhSq9L-8H0g&s=jliQ_5N9_-n0EN1qXNmzeBX4m8Xhdcv_UtaHQ812L9Y&e=> > > > -Rui > > On Thu, Feb 13, 2020 at 9:05 PM Talat Uyarer > wrote: > >> Hi, >> >> I am trying to Beam SQ

Beam SQL Nested Rows are flatten by Calcite

2020-02-13 Thread Talat Uyarer
Hi, I am trying to Beam SQL. But something is wrong. I have nested row records. I read them as Pcollection and apply Select * query and compare with initial rows. Looks like nested rows are flatten by calcite. How do you have any idea how can I avoid this? I added a same testcase for my issue: S

Re: Limit log files count with Dataflow Runner Logging

2019-08-30 Thread Talat Uyarer
ers_google-2Dcloud-2Ddataflow-2Djava_src_main_java_org_apache_beam_runners_dataflow_options_DataflowWorkerLoggingOptions.java&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=arH8HGOgLWDsll2Qps80JI6SWWgq15tNuuSDE9DNiQc&s=ouENg15eTeTkuJsO8gn4tDCp__yxdO6ZJ4dc5PK_poE&e=> > > On W

Limit log files count with Dataflow Runner Logging

2019-08-28 Thread Talat Uyarer
Hi All, This is my first message for this maillist. Please let me know if I am sending this message to wrong maillist. My stream processing job are running on Google Cloud Dataflow engine. For logging I am using Stackdriver. I added runtime slf4j-jdk14 and slf4j-api to enable to stackdriver. Howe