Hi,
I saw this a little bit late. I implement a custom count distinct for our
streaming use case. If you are looking for something close enough but not
exact you can use my UDF. It uses the HyperLogLogPlus algorithm, which is
an efficient and scalable way to estimate cardinality with a controlled
Hi All,
I have a stream aggregation job which reads from Kafka and writes some
Sinks.
When I submit my job Flink checkpoint size keeps increasing if I use
unaligned checkpoint settings and it does not emit any window results.
If I use an aligned checkpoint, size is somewhat under control(still bi
Hi,
I have a stream aggregation job which is running on Flink 1.13 I generate
DAG by using Beam SQL. My SQL query has a TUMBLE window. Basically My
pipeline reads from kafka aggregate, counts/sums some values by streamin
aggregation and writes a Sink.
BeamSQl uses Groupbykey for the aggregation p
Vr&s=g36wnBGvi7DQG7gvljaG08vXIhROyCoz5vWBBRS43Ag&e=
>
> I'm experimenting with ideas on how to work around this but a fix will
> likely require a Calcite upgrade, which is not something I'd have time
> to help with. (I'm not on the Google Beam team anymore.)
>
ovider);
> +PCollection stream =
> +BeamSqlRelUtils.toPCollection(
> +pipeline, sqlEnv.parseQuery("WITH tempTable AS (SELECT * FROM
> basicRowTestTable WHERE basicRowTestTable.col.string_field = 'innerStr')
> SELECT * FROM tempTable"));
> +Schema outputS
t;>>
>>> One minor issue I encountered. It took me a while to get your test case
>>> running, it doesn't appear there are any calcite gradle rules to run
>>> CoreQuidemTest and constructing the classpath manually was tedious. Did I
>>> miss something?
>&g
Hi,
Do you use Flink operator or manually deployed session cluster ?
Thanks
On Fri, Feb 3, 2023, 4:32 AM P Singh wrote:
> Hi Team,
>
> I have set up a flink cluster on GKE and am trying to submit a beam
> pipeline with below options. I was able to run this on a local machine but
> I don't unde
Hi,
I know we use the portability framework when the sdk language (python) is
different from the runner's language(java) .
If my runner is Java based and I want to use the portability framework for
Java SDK. Is there any optimization on the Beam side rather than running
two separate docker images
suspect the problem lies there. I spent some time looking but wasn't able
> to find it by code inspection, it looks like this code path is doing the
> right thing with names. I'll spend some time tomorrow trying to reproduce
> this on pure Calcite.
>
> Andrew
>
>
> O
;
> Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build();
> assertThat(result.getSchema(), equalTo(output_schema));
>
> Row output = Row.withSchema(output_schema).addValues(1,
> "strstr").build();
> PAssert.that(
tes Java bytecode to directly execute the DSL.
>>
>> Problem: it looks like the CalcRel has output columns with aliases "id"
>> and "v" where it should have output columns with aliases "id" and "value".
>>
>> Kenn
>>
>&
Hi All,
I am using Beam 2.43 with Calcite SQL with Java.
I have a query with a WITH clause and some aliasing. Looks like Beam Query
optimizer after optimizing my query, it drops Select statement's aliases.
Can you help me to identify where the problem is ?
This is my query
INFO: SQL:
WITH `tempT
HI Ifat,
Did you enable fasterCopy parameter ?
Please look at this issue: https://issues.apache.org/jira/browse/BEAM-11146
Thanks
On Mon, May 2, 2022 at 12:57 AM Afek, Ifat (Nokia - IL/Kfar Sava) <
ifat.a...@nokia.com> wrote:
> Hi,
>
>
>
> I’m investigating a slowness of our beam pipelines, an
l_src_main_java_org_apache_beam_sdk_extensions_sql_impl_rel_BeamSortRel.java-23L200&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=t64ngtbWsSfQC4rND6Nc4ri8I--Ey30-KIuKISgxxI0&s=f5C-HFf6aqOpMxRyaDjkMGNb7PsX-puF6K4cW-b_gTI&e=>
>
Hi,
I am trying to calculate the top k element based on a streaming
aggregation per window. Do you know if there is any support for it
on BeamSQL or How can achieve this goal with BeamSQL on stream ?
Sample Query
SELECT customer_id, app, sum(bytes_total) as bytes_total FROM PCOLLECTION
GROUP BY c
Hi,
Based on Join documentation. If I have a Join with Unbounded and Bounded
> For this type of JOIN bounded input is treated as a side-input by the
> implementation. This means that window/trigger is inherented from upstreams.
On my pipeline I dont have any triggering or window. I use a global
t update data with any mechanisim.
Thanks
On Wed, Feb 10, 2021 at 12:41 PM Talat Uyarer
wrote:
> Does beam create UDF function for every bundle or in setup of pipeline ?
>
> I will keep internal state in memory. The Async thread will update that in
> memory state based on an interva
:37 PM Rui Wang wrote:
> The problem that I can think of is maybe before the async call is
> completed, the UDF life cycle has reached to the end.
>
>
> -Rui
>
> On Wed, Feb 10, 2021 at 12:34 PM Talat Uyarer <
> tuya...@paloaltonetworks.com> wrote:
>
>> Hi,
Hi,
We plan to use UDF on our sql. We want to achieve some kind of
filtering based on internal states. We want to update that internal state
with a separate async thread in UDF. Before implementing that thing I want
to get your options. Is there any limitation for UDF to have multi-thread
implemen
>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Tue, Dec 8, 2020 at 9:33 AM Brian Hulette
>>>>>> wrote:
>>>>>>
>>>>>>> Reuven, could you clarify what you have in mind? I know multiple
>>>>>>
compatibility
>>>>>> support to SchemaCoder, including support for certain schema changes
>>>>>> (field
>>>>>> additions/deletions) - I think the most recent discussion was here [1].
>>>>>>
>>>>>> But it sounds
might be theoretically possible to do this (at least for the
> case where existing fields do not change). Whether anyone currently has
> available time to do this is a different question, but it's something that
> can be looked into.
>
> On Mon, Dec 7, 2020 at 9:29 PM Talat Uyar
ges, are these new fields being added to the
> schema? Or are you making changes to the existing fields?
>
> On Mon, Dec 7, 2020 at 9:02 PM Talat Uyarer
> wrote:
>
>> Hi,
>> For sure let me explain a little bit about my pipeline.
>> My Pipeline is actually simple
&g
m unless you are
> also changing the SQL statement?
>
> Is this a case where you have a SELECT *, and just want to make sure those
> fields are included?
>
> Reuven
>
> On Mon, Dec 7, 2020 at 6:31 PM Talat Uyarer
> wrote:
>
>> Hi Andrew,
>>
>> I assume
QL to try to produce
> the same plan for an updated SQL query.
>
> Andrew
>
> On Mon, Dec 7, 2020 at 5:44 PM Talat Uyarer
> wrote:
>
>> Hi,
>>
>> We are using Beamsql on our pipeline. Our Data is written in Avro format.
>> We generate our rows bas
Hi,
We are using Beamsql on our pipeline. Our Data is written in Avro format.
We generate our rows based on our Avro schema. Over time the schema is
changing. I believe Beam SQL generates Java code based on what we define as
BeamSchema while submitting the pipeline. Do you have any idea How can we
GC thrashing.
> Memory is used/total/max = 4112/5994/5994 MB, GC last/max = 97.36/97.36 %,
> #pushbacks=3, gc thrashing=true. Heap dump not written.
And also my process rate is 4kps per instance. I would like to hear your
suggestions if you have any.
Thanks
On Wed, Sep 2, 2020 at 6:22 PM Tal
I also tried Brian's suggestion to clear stringbuilder by calling delete
with stringbuffer length. No luck. I am still getting the same error
message. Do you have any suggestions ?
Thanks
On Wed, Sep 2, 2020 at 3:33 PM Talat Uyarer
wrote:
> If I'm understanding Talat's log
to-reuse-a-stringbuilder-in-a-loop
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_242438_is-2Dit-2Dbetter-2Dto-2Dreuse-2Da-2Dstringbuilder-2Din-2Da-2Dloop&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3V
VYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=EFsNtBMh3aQSH1MXWx0-YmpRIgUHj6EfHvulHdoBkdw&s=H5Fupf1d3R199PF73T8D8YYAnipbS3YdJKj_4Ep2-DU&e=>
>
> On Wed, Sep 2, 2020 at 3:02 PM Talat Uyarer
> wrote:
>
>> Sorry for the wrong import. You can see on the code I am using
>> Stri
AhADvjz4ucjndSmzyOZ8FPBvJ_0oZQ&e=>
>
> Could you try using StringBuilder instead since the usage is not
> appropriate for a StringWriter?
>
>
> On Wed, Sep 2, 2020 at 2:49 PM Talat Uyarer
> wrote:
>
>> Hi,
>>
>> I have an issue with String Co
Hi,
I have an issue with String Concatenating. You can see my code below.[1] I
have a step on my df job which is concatenating strings. But somehow when I
use that step my job starts getting jvm restart errors.
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
> Memory is u
? (Even if MessageExtractor seems simple it isn't free, You
> have to now write to two GCS locations instead of one for each work item
> that you process so your doing more network calls)
>
>
> On Wed, Aug 19, 2020 at 8:36 PM Talat Uyarer
> wrote:
>
>> Filter
you can have a single Pipeline and do as
> many reads as you want and the'll all get executed in the same job.
>
> On Wed, Aug 19, 2020 at 4:14 PM Talat Uyarer
> wrote:
> >
> > Hi Robert,
> >
> > I calculated process speed based on worker count. When I have sep
7pAR8g&m=Erfg03JLKLNG3lT2ejqq7_fbvfL95-wSxZ5hFKqzyKU&s=JsWPJxBXopYYenfBAp6nkwfB0Q1Dhs1d4Yi41fBY3a8&e=
> ).
>
> - Robert
>
> On Wed, Aug 19, 2020 at 3:37 PM Talat Uyarer
> wrote:
> >
> > Hi,
> >
> > I have a very simple DAG on my dataflow job.
> (KafkaIO->Filter-&g
Hi,
I have a very simple DAG on my dataflow job. (KafkaIO->Filter->WriteGCS).
When I submit this Dataflow job per topic it has 4kps per instance
processing speed. However I want to consume two different topics in one DF
job. I used TupleTag. I created TupleTags per message type. Each topic has
dif
PJNRn4tog4I&e=>
> [3] https://issues.apache.org/jira/browse/BEAM-9977
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D9977&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=RU
Hi,
I am using Dataflow. When I update my DF job with a new topic or update
partition count on the same topic. I can not use DF's update function.
Could you help me to understand why I can not use the update function for
that ?
I checked the Beam source code but I could not find the right place t
eam 2.19 is using gradle 5.2.1, is the installed version
> compatible with that?
>
> Try
> ./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar
> in a clean workspace.
>
> On Fri, Jun 12, 2020 at 4:30 PM Talat Uyarer
> wrote:
>
>> Hi,
>>
>>
Hi,
I want to build the dataflow worker on apache beram 2.19. However I faced a
grpc issue. I did not change anything. Just checked release-2.19.0 branch
and run build command. Could you help me understand why it does not build.
[1]
Additional information, Based on my limited knowledge Looks like
ng and closing bundle*”?
>
> Sometimes, starting a KafkaReader can take time since it will seek for a
> start offset for each assigned partition but it happens only once at
> pipeline start-up and mostly depends on network conditions.
>
> On 9 Jun 2020, at 23:05, Talat Uyarer
> wrote
Hi,
I added some metrics on a step right after KafkaIO. When I compare the read
time difference between producer and KafkaIO it is 800ms for P99. However
somehow that step's opening and closing bundle difference is 18 seconds for
p99. The step itself does not do any specific thing. Do you have any
e.proofpoint.com/v2/url?u=https-3A__beam.apache.org_releases_javadoc_2.21.0_org_apache_beam_sdk_transforms_DoFn.ProcessElement.html&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=KuUWakZ-xaVGYfsw7YGz1WBOLIlpBHikvRxgZs9vWn0&s=nkJq_weo7lr
Q&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=1202mTv7BP1KzcBJECS98dr7u5riw0NHdl8rT8I6Ego&s=cPdnrK4r-tVd0iAO6j7eAAbDPISOdazEYBrPoC9cQOo&e=>
>
>
> On Thu, May 28, 2020 at 1:12 PM Talat Uyarer
> wrote:
>
>> Yes I am trying to track how long
essing takes?
> Do you care about how much CPU time is being consumed in aggregate for all
> the processing that your pipeline is doing?
>
>
> On Thu, May 28, 2020 at 11:01 AM Talat Uyarer <
> tuya...@paloaltonetworks.com> wrote:
>
>> I am using Dataflow Runner.
I am using Dataflow Runner. The pipeline read from kafkaIO and send Http. I
could not find any metadata field on the element to set first read time.
On Thu, May 28, 2020 at 10:44 AM Kyle Weaver wrote:
> Which runner are you using?
>
> On Thu, May 28, 2020 at 1:43 PM Talat Uyarer
Hi,
I have a pipeline which has 5 steps. What is the best way to measure
processing time for my pipeline?
Thnaks
Hi,
I want to get Kafka's backlog metrics. In apache beam code I saw beam is
collecting that metrics in here[1] as Source Metrics. However I can not see
those metrics on Dataflow's metrics explorer. Do you know is there
anyway to get those metrics ?
Also I saw there is MetricsSink. But based on b
ed
> Row structure (as they were flattened in LogicalPlan).
>
> Depends on how that patch works. Nested row might not immediately work
> after you apply that patch.
>
>
> -Rui
>
> On Fri, Feb 14, 2020 at 3:14 PM Talat Uyarer
> wrote:
>
>> Do you mean they were f
ttps-3A__jira.apache.org_jira_browse_CALCITE-2D3138&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=kPxDNZSy_WpbC0xfVKTFpbSnpFAdhwMZYhSq9L-8H0g&s=jliQ_5N9_-n0EN1qXNmzeBX4m8Xhdcv_UtaHQ812L9Y&e=>
>>>
>>>
>>&
L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=kPxDNZSy_WpbC0xfVKTFpbSnpFAdhwMZYhSq9L-8H0g&s=jliQ_5N9_-n0EN1qXNmzeBX4m8Xhdcv_UtaHQ812L9Y&e=>
>
>
> -Rui
>
> On Thu, Feb 13, 2020 at 9:05 PM Talat Uyarer
> wrote:
>
>> Hi,
>>
>> I am trying to Beam SQ
Hi,
I am trying to Beam SQL. But something is wrong. I have nested row records.
I read them as Pcollection and apply Select * query and compare with
initial rows. Looks like nested rows are flatten by calcite. How do you
have any idea how can I avoid this?
I added a same testcase for my issue:
S
ers_google-2Dcloud-2Ddataflow-2Djava_src_main_java_org_apache_beam_runners_dataflow_options_DataflowWorkerLoggingOptions.java&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=arH8HGOgLWDsll2Qps80JI6SWWgq15tNuuSDE9DNiQc&s=ouENg15eTeTkuJsO8gn4tDCp__yxdO6ZJ4dc5PK_poE&e=>
>
> On W
Hi All,
This is my first message for this maillist. Please let me know if I am
sending this message to wrong maillist.
My stream processing job are running on Google Cloud Dataflow engine. For
logging I am using Stackdriver. I added runtime slf4j-jdk14 and slf4j-api
to enable to stackdriver. Howe
54 matches
Mail list logo