Re: Twister2 Runner

2025-06-30 Thread Kenneth Knowles
+1 good idea. If the upstream project is not updating then most likely this can be archived/removed. We did this for Apex and Gearpump without waiting for a major version bump. Here is that old thread: https://lists.apache.org/thread/p63nbgb6hx0rl8287zjt6q96zwwrtqmr Kenn On Fri, Jun 27, 2025 at

Re: [Python] Proper way to type dofns with multiple output tags?

2025-05-14 Thread Kenneth Knowles
ork on my plate. It's definitely on the list of type hinting improvements > though. > > On Wed, May 14, 2025 at 2:14 PM Kenneth Knowles wrote: > >> Replying to break the silence - in Java the DoFn is done according to the >> main output type, then phantom types on th

Re: [Python] Proper way to type dofns with multiple output tags?

2025-05-14 Thread Kenneth Knowles
Replying to break the silence - in Java the DoFn is done according to the main output type, then phantom types on the output tag are used to make sure non-main outputs are type safe (I wouldn't expect this sort of technique in Python) Anyone who is more expert in Beam Python typing stuff? +Jack Mc

Re: [Question] How to maintain cross-window state

2025-05-06 Thread Kenneth Knowles
I/O overhead from reading and writing to > Bigtable has become significant. As a result, we're exploring the > possibility of using stateful processing to manage this state more > efficiently. > > Best, > > Shaochen > > On 5 May 2025, at 17:46, Kenneth Knowles wrote: >

Re: [Question] How to maintain cross-window state

2025-05-05 Thread Kenneth Knowles
Hello! This is not possible in a simple way, because of the main fact: windows are processed simultaneously. Many windows may have some state and incoming data at the same time, even if the time ranges of your windows do not overlap. So, sharing state across windows would need concurrency control

Re: [Discussion] Deprecate ZetaSQL

2025-03-31 Thread Kenneth Knowles
> further, not earlier than 1 full quarter before the next Release (2.65.0 + > 3 months would be 2.68.0), and only when it causes issues on maintenance. > > Thanks, > > Yi > > > On Wed, Mar 26, 2025 at 12:18 PM Kenneth Knowles wrote: > >> +1 to this deprecatio

Re: [Discussion] Deprecate ZetaSQL

2025-03-26 Thread Kenneth Knowles
+1 to this deprecation. Thanks for putting together a clear summary. FWIW it also has significantly worse performance than Calcite SQL dialect, since it calls out to a ZetaSQL subprocess for most calculations, and that is less optimized than Beam's Fn API. Kenn On Tue, Mar 25, 2025 at 4:18 PM Ro

Re: Non-time based windowing

2025-02-03 Thread Kenneth Knowles
This idea makes perfect sense conceptually. Conceptually, a "window" is just a key that has the ability to know if it is "done" so we can stop waiting for more input and emit the aggregation. To do an aggregation over unbounded input, you need to know when to stop waiting for more data. In Beam, t

Re: Problem with Apache Beam BigQuery Batch Load Job

2024-10-25 Thread Kenneth Knowles
Hi Pranjal, If there is a bug in Beam, this is a good list to contact. If there is a problem with a GCP service, then GCP support is better. I see the code you shared, but what error or difficulty are you encountering? Kenn On Mon, Oct 21, 2024 at 2:33 PM Pranjal Pandit wrote: > Hi Kenneth /

Re: Beam Jobs built using ver 2.50 and above are not running with FlinkRunners v 1.16.x

2024-10-24 Thread Kenneth Knowles
Hi Rajath, Thanks for raising this. Do you mean that you cloned https://github.com/apache/beam-starter-java and tried to run on the FlinkRunner with version 1.16? it looks like the operation that is failing has been in Beam since 2.50.0 ( https://github.com/apache/beam/pull/26193) so I would exp

Re: [Question] Regarding custom metrics in beam

2024-10-22 Thread Kenneth Knowles
Thanks for reporting this. I know there's problems with the plumbing of metrics in the FlinkRunner, but I don't know the full extent. I'm actually starting to work on that right now. I filed https://github.com/apache/beam/issues/32895 with your report and I'll add others as I gather them up. Kenn

Re: Elements are not processed in parallel by a splittable DoFn on the Flink Runner

2024-10-03 Thread Kenneth Knowles
This is due to fusion. Semantic description of the pipeline is: - The ParDo(CreateQueryFn) outputs each query as a separate element. - The ParDo(ProcessQueryFn) processes the elements and it is up to the FlinkRunner how it wants to provide them, and also the Python SDK harness. In practice wha

Re: [DISCUSS] Drop Euphoria extension

2023-10-19 Thread Kenneth Knowles
Makes sense to me. Let's deprecate for the 2.52.0 release unless there is some objection. You can also look at the maven central downloads (I believe all PMC and maybe all committers can view this) compared to other Beam jars. Kenn On Mon, Oct 16, 2023 at 9:28 AM Jan Lukavský wrote: > Sure, tha

[ANNOUNCE] Apache Beam 2.51.0 Released

2023-10-18 Thread Kenneth Knowles
The Apache Beam Team is pleased to announce the release of version 2.51.0. You can download the release here: https://beam.apache.org/get-started/downloads/ This release includes bug fixes, features, and improvements detailed on the Beam Blog: https://beam.apache.org/blog/beam-2.51.0/ and the Gi

Re: simplest way to do exponential moving average?

2023-10-02 Thread Kenneth Knowles
Just to be pedantic about it: Jan's approach is preferred because it would be much more _parallel_. Any actual computation that depends on everything being in order is by definition not parallel (nothing to do with Beam). Kenn On Mon, Oct 2, 2023 at 5:00 AM Jan Lukavský wrote: > Hi, > > this de

Re: Seeking Assistance to Resolve Issues/bug with Flink Runner on Kubernetes

2023-08-14 Thread Kenneth Knowles
There is a slack channel linked from https://beam.apache.org/community/contact-us/ it is #beam on the-asf.slack.com (you find this via beam.apache.org -> Community -> Contact Us) It sounds like an issue with running a multi-language pipeline on the portable flink runner. (something which I am not

Re: Beam SQL found limitations

2023-07-10 Thread Kenneth Knowles
apache.org/docs/stream.html#sliding-windows > > I think this is precisely what I was trying to do with Beam SQL and the > syntax is also very intuitive. > > Could this be added to SQL roadmap? How hard it is for implementation? > > Best > > Piotr > On 31.05.2023 20:29,

Re: Beam SQL found limitations

2023-05-31 Thread Kenneth Knowles
actually resolving if a message belongs to a window > is done later when evaluating `LEFT JOIN`? Target DataFlow. I am still > learning Beam so there might be some core thing that I miss to understand > how it is processed. > > 2. Any hints on implementing FirestoreIOTableProvider? just

Re: Beam SQL found limitations

2023-05-26 Thread Kenneth Knowles
Just want to clarify that Beam's concept of windowing is really an event-time based key, and they are all processed logically simultaneously. SQL's concept of windowing function is to sort rows and process them linearly. They are actually totally different. From your queries it seems you are intere

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-14 Thread Kenneth Knowles
SGTM. I asked on the PR if this could impact users, but having read the docker release calendar I am not concerned. The last update to the old version was in 2019, and the introduction of compatible versions was 2020. On Tue, Feb 14, 2023 at 3:01 PM Byron Ellis via user wrote: > FWIW I am Team U

Re: Beam SQL Alias issue while using With Clause

2023-01-23 Thread Kenneth Knowles
don't like to guess that upstream libraries have the bug, but in this case I wonder if the alias is lost in the Calcite optimizer rule for merging the projects and filters into a Calc. Kenn On Mon, Jan 23, 2023 at 10:13 AM Kenneth Knowles wrote: > I am not sure I understand the question,

Re: Beam SQL Alias issue while using With Clause

2023-01-23 Thread Kenneth Knowles
, and then generates Java bytecode to directly execute the DSL. Problem: it looks like the CalcRel has output columns with aliases "id" and "v" where it should have output columns with aliases "id" and "value". Kenn On Thu, Jan 19, 2023 at 6:01 PM Ahmet

Re: Checkpointing on Google Cloud Dataflow Runner

2022-08-29 Thread Kenneth Knowles
Hi Will, David, I think you'll find the best source of answer for this sort of question on the user@beam list. I've put that in the To: line with a BCC: to the dev@beam list so everyone knows they can find the thread there. If I have misunderstood, and your question has to do with building Beam it

Re: Bazel based build

2022-06-09 Thread Kenneth Knowles
A big reason we chose Gradle over Bazel was that we had a hackathon to implement both, and only one person* chose to work on Bazel while the Gradle team attracted many contributors. Having a build system with more widespread knowledge and interest is an important consideration. Basically I suggest

Re: Potential Bug: Beam + Flink + AutoValue Builders

2021-11-17 Thread Kenneth Knowles
Noticing that you had another question on this thread. If I understand correctly, the answer to your question is that Beam converting objects to/from Row uses bytecode generation for performance, and automatically figures out how to map the columns of the Row to builder methods. But I may be misund

Re: Apache Beam Go SDK Quickstart bugs

2021-11-08 Thread Kenneth Knowles
Awesome! Just going to add a few colleagues (who are subscribed anyhow) to make sure this hits the top of their inbox. +Robert Burke +Chamikara Jayalath +Kyle Weaver Kenn On Thu, Nov 4, 2021 at 11:40 PM Jeff Rhyason wrote: > I'm interested to see the Go SDK work with the Spark runner. Based

Re: How to read data from bigtable

2021-07-12 Thread Kenneth Knowles
Hi Raja & all, I'm moving the thread to user@beam.apache.org since I think many users may be interested in the answer to the question, or have other experiences to share. Kenn On Mon, Jul 12, 2021 at 12:58 PM Chamikara Jayalath wrote: > There's also an external connector that us directly suppo

Re: Help: Apache Beam Session Window with limit on number of events and time elapsed from window start

2021-07-07 Thread Kenneth Knowles
Hi Chandan, I am moving this thread to user@beam.apache.org. I think that is the best place to discuss. Kenn On Wed, Jul 7, 2021 at 9:32 AM Chandan Bhattad wrote: > Hi Team, > > Hope you are doing well. > > I have a use case around session windowing with some customizations. > > We need to hav

Re: No data sinks have been created yet.

2021-06-10 Thread Kenneth Knowles
Beam doesn't use Flink's sink API. I recall from a very long time ago that we attached a noop sink to each PCollection to avoid this error. +Kyle Weaver might know something about how this applies to Python on Flink. Kenn On Wed, Jun 9, 2021 at 4:41 PM Trevor Kramer wrote: > Hello Beam communi

Re: Allyship workshops for open source contributors

2021-06-07 Thread Kenneth Knowles
Yes please! On Thu, Jun 3, 2021, 18:32 Ratnakar Malla wrote: > +1 > > > -- > *From:* Austin Bennett > *Sent:* Thursday, June 3, 2021 6:20:25 PM > *To:* user@beam.apache.org > *Cc:* dev > *Subject:* Re: Allyship workshops for open source contributors > > +1, assumin

Re: RenameFields behaves differently in DirectRunner

2021-06-03 Thread Kenneth Knowles
> I'm assuming a CoderTester would require manually generating inputs right? > These input Rows represent an illegal state that we wouldn't test with. > (That being said I like the idea of a CoderTester in general) > > Brian > > On Wed, Jun 2, 2021 at 12:11 PM Kennet

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Kenneth Knowles
that wouldn't catch it. We could >> insert CoderProperties.coderDecodeEncodeEqual in a subsequent ParDo, but if >> the Direct runner already does an encode/decode before that ParDo, then >> that would have fixed the problem before we could see it. >> >> On Wed, Jun 2, 2021 at 11:53 AM Kenne

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Kenneth Knowles
gt;>>- JIRA ticket - https://issues.apache.org/jira/browse/BEAM-12442 >>>>>> >>>>>> >>>>>> On Tue, Jun 1, 2021 at 5:51 PM Reuven Lax wrote: >>>>>> >>>>>>> This transform is the same across all runn

Re: RenameFields behaves differently in DirectRunner

2021-06-01 Thread Kenneth Knowles
On Tue, Jun 1, 2021 at 12:42 PM Brian Hulette wrote: > Hi Matthew, > > > The unit tests also seem to be disabled for this as well and so I don’t > know if the PTransform behaves as expected. > > The exclusion for NeedsRunner tests is just a quirk in our testing > framework. NeedsRunner indicates

Re: Windowing

2021-05-26 Thread Kenneth Knowles
hanks, > Serge > > > > > On 26 May 2021 at 18:01:14, Kenneth Knowles (k...@apache.org) wrote: > > You can use Window.configure() to only set the values you want to change. > Is that what you mean? > > Kenn > > On Wed, May 26, 2021 at 8:42 AM Sozonoff Serge wrote

Re: Windowing

2021-05-26 Thread Kenneth Knowles
You can use Window.configure() to only set the values you want to change. Is that what you mean? Kenn On Wed, May 26, 2021 at 8:42 AM Sozonoff Serge wrote: > Hi All, > > I find myself having to pepper Window transforms all over my pipeline, I > count about 9 in order to get my pipeline to run.

Re: Error with Beam/Flink Python pipeline with Kafka

2021-05-24 Thread Kenneth Knowles
Thanks for the details in the gist. I would guess the key to the error is this: 2021/05/24 13:45:34 Initializing java harness: /opt/apache/beam/boot --id=1-2 --provision_endpoint=localhost:33957 2021/05/24 13:45:44 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attemp

Re: How to flush window when draining a Dataflow pipeline?

2021-05-21 Thread Kenneth Knowles
+dev +Reuven Lax Advancing the watermark to infinity does have an effect on the GlobalWindow. The GlobalWindow ends a little bit before infinity :-). That is why this works to cause the output even for unbounded aggregations. Kenn On Fri, May 21, 2021 at 5:10 AM Jeff Klukas wrote: > Beam use

Re: Beam/Dataflow pipeline backfill via JDBC

2021-05-12 Thread Kenneth Knowles
Can you share some more details, such as code? We may identify something that relies upon assumptions from batch execution style. Also notably the Java DirectRunner does not have separate batch/streaming mode. It always executes in a "streaming" sort of way. It is also simpler in some ways so if y

Re: DirectRunner, Fusion, and Triggers

2021-05-12 Thread Kenneth Knowles
On Sat, May 8, 2021 at 12:00 AM Bashir Sadjad wrote: > Hi Beam-users, > > *TL;DR;* I wonder if DirectRunner does any fusion optimization > > and whether this has any impact on triggers/panes? > > *Details* (the context for

Re: Question on late data handling in Beam streaming mode

2021-04-23 Thread Kenneth Knowles
ri, Apr 23, 2021 at 8:14 AM Tao Li wrote: > >> Thanks @Kenneth Knowles . I understand we need to >> specify a window for groupby so that the app knowns when processing is >> “done” to output result. >> >> >> >> Is it possible to specify a event arrival/process

Re: Question on late data handling in Beam streaming mode

2021-04-22 Thread Kenneth Knowles
Hello! In a streaming app, you have two choices: wait forever and never have any output OR use some method to decide that aggregation is "done". In Beam, the way you decide that aggregation is "done" is the watermark. When the watermark predicts no more data for an aggregation, then the aggregati

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Kenneth Knowles
mantics in the event of shutting down a stream and > resuming it at a later point. > > In Kafka you don't need this since as long as we ensure our offsets are > committed in finalization of a bundle, the offsets for a particular group > id are stored on the server. > >

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Kenneth Knowles
run "streaming-beam-sql-`date > +%Y%m%d-%H%M%S`" \ > --template-file-gcs-location "$TEMPLATE_PATH" \ > --parameters inputSubscription="$SUBSCRIPTION" \ > --parameters outputTable="$PROJECT:$DATASET.$TABLE" \ > --region "

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Kenneth Knowles
I would assume the main issue is resuming reading from the Kinesis stream from the last read? In the case for Pubsub (just as another example of the idea) this is part of the internal state of a pre-created subscription. Kenn On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey wrote: > Hi list, > > w

Re: Global window + stateful transformation

2021-03-31 Thread Kenneth Knowles
On Wed, Mar 31, 2021 at 10:20 AM Kenneth Knowles wrote: > > On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria < > hsuta...@paloaltonetworks.com> wrote: > >> I have a global window with per-key-and-window stateful processing >> dataflow job. Do I need groupbykey in my

Re: Global window + stateful transformation

2021-03-31 Thread Kenneth Knowles
Great question! Moving this to user@beam.apache.org Kenn On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria < hsuta...@paloaltonetworks.com> wrote: > Beam Developers, > > I have a global window with per-key-and-window stateful processing > dataflow job. Do I need groupbykey in my transform ? Thank

Re: Triggering partway through a window

2021-03-29 Thread Kenneth Knowles
That's a neat example! The trigger you have there will emit a ton of output. What is your accumulation mode? I assume it must be accumulatingFiredPanes() otherwise you would not actually have access to the prior 6 days of input. The only trigger that is based on completeness of data is the AfterW

Re: General guidance

2021-03-25 Thread Kenneth Knowles
This is a Beam issue indeed, though it is an issue with the FlinkRunner. So I think I will BCC the Flink list. You may be in one of the following situations: - These timers should not be viewed as distinct by the runner, but deduped, per https://issues.apache.org/jira/browse/BEAM-8212#comment-169

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Kenneth Knowles
4, 2021 at 5:30 PM Reuven Lax wrote: > >> I believe that the Wait transform turns this output into a side input, so >> outputting the input PCollection might be problematic. >> >> On Wed, Mar 24, 2021 at 4:49 PM Kenneth Knowles wrote: >> >>> Alex's id

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Kenneth Knowles
Alex's idea sounds good and like what Vincent maybe implemented. I am just reading really quickly so sorry if I missed something... Checking out the code for the WriteFn I see a big problem: @Setup public void setup() { writer = new Mutator<>(spec, Mapper::saveAsync, "writes");

Do you use TimestampCombiner.EARLIEST with SlidingWindows or other overlapping windows?

2021-03-11 Thread Kenneth Knowles
Hi users, ** Do you use TimestampCombiner.EARLIEST with SlidingWindows or other overlapping windows? If no, then you can stop reading now. ** We are considering a change to simplify how timestamps are computed for aggregations in this case. Warning: this is a bit complicated and a curious corner

Re: Iterative algorithm in BEAM

2021-01-21 Thread Kenneth Knowles
Your idea works: first read the directory metadata and then downstream an SDF that jumps to file offsets. This is very much what SDF is for. Splitting within a zip entry will not be useful. If you have indeterminate nesting depth, you do need iterative computation. Beam doesn't directly support th

Re: Is there an array explode function/transform?

2021-01-13 Thread Kenneth Knowles
Just the fields specified, IMO. When in doubt, copy SQL. (and I mean SQL generally, not just Beam SQL) Kenn On Wed, Jan 13, 2021 at 11:17 AM Reuven Lax wrote: > Definitely could be a top-level transform. Should it automatically unnest > all arrays, or just the fields specified? > > We do have t

Re: Is there an array explode function/transform?

2021-01-12 Thread Kenneth Knowles
Explode is called UNNEST in Beam SQL (and I believe this is the more standard name). FlatMap(arr -> arr) is a simple and efficient implementation for straight Beam. Kenn On Tue, Jan 12, 2021 at 4:58 PM Kyle Weaver wrote: > @Reuven Lax yes I am aware of that transform, but >> that’s different

Re: Side input in streaming

2021-01-12 Thread Kenneth Knowles
ghly appreciate it. > > On Thu, 7 Jan 2021 at 18:29, Kenneth Knowles wrote: > >> Actually, if you want to actually re-read the BQ table then you need >> something more, following the pattern here: >> https://beam.apache.org/documentation/patterns/side-inputs/. There are >>

Re: Side input in streaming

2021-01-07 Thread Kenneth Knowles
; side input on setup would be the best option. > > On Tue, 5 Jan 2021 at 17:53, Kenneth Knowles wrote: > >> You have it basically right. However, there are a couple minor >> clarifications: >> >> 1. A particular window on the side input is not "ready" un

Re: Side input in streaming

2021-01-05 Thread Kenneth Knowles
You have it basically right. However, there are a couple minor clarifications: 1. A particular window on the side input is not "ready" until there has been some element output to it (or it has expired, which will make it the default value). Main input elements will wait for the side input to be re

Re: Combine with multiple outputs case Sample and the rest

2021-01-05 Thread Kenneth Knowles
Perhaps something based on stateful DoFn so there is a simple decision point at which each element is either sampled or not so it can be output to one PCollection or the other. Without doing a little research, I don't recall if this is doable in the way you need. Kenn On Wed, Dec 23, 2020 at 3:12

Re: Help measuring upcoming performance increase in flink runner on production systems

2020-12-21 Thread Kenneth Knowles
I really think we should make a plan to make this the default. If you test with the DirectRunner it will do mutation checking and catch pipelines that depend on the runner cloning every element. (also the DirectRunner doesn't clone). Since the cloning is similar in cost to the mutation detection, c

Re: [ANNOUNCE] Beam 2.25.0 Released

2020-10-26 Thread Kenneth Knowles
Hooray! Thanks Robin! On Mon, Oct 26, 2020 at 11:51 AM Rui Wang wrote: > Thank you Robin! > > > -Rui > > On Mon, Oct 26, 2020 at 11:44 AM Pablo Estrada wrote: > >> Thanks Robin! >> >> On Mon, Oct 26, 2020 at 11:06 AM Robin Qiu wrote: >> >>> The Apache Beam team is pleased to announce the relea

Re: Which Solr versions should be supported by Beam

2020-10-23 Thread Kenneth Knowles
This might be a good question for u...@solr.apache.org and/or d...@solr.apache.org, too. Kenn On Fri, Oct 23, 2020 at 6:24 AM Piotr Szuberski wrote: > Beam has quite old Solr dependency (5.x.y) which has been deprecated for a > long time. > > Solr dependency has recently been updated to 8.6.y,

Re: Count based triggers and latency

2020-10-12 Thread Kenneth Knowles
Another thing to keep in mind - apologies if it was already clear: triggering governs aggregation (GBK / Combine). It does not have any effect on stateful DoFn. On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik wrote: > The default trigger will only fire when the global window closes which > does happen

Re: Support of per-key state after windowing

2020-08-23 Thread Kenneth Knowles
Yes :-) On Sun, Aug 23, 2020 at 2:16 PM Reuven Lax wrote: > Kenn - shouldn't the Reify happen before the rewindow? > > On Sun, Aug 23, 2020 at 11:08 AM Kenneth Knowles wrote: > >> >> >> On Sun, Aug 23, 2020 at 1:12 PM Dongwon Kim >> wrote: >> &

Re: Support of per-key state after windowing

2020-08-23 Thread Kenneth Knowles
;>>> ) >>>> .apply(GroupByKey.create()) >>>> // (F) >>>> .apply(ParDo.of(new MyDoFn())) >>>> // (D) >>> >>> >>> I had to include (E1), (E2), (E3), and (F) so that MyDoFn (D) can >>> iterate

Re: Support of per-key state after windowing

2020-08-22 Thread Kenneth Knowles
Hi Dongwon, On Sat, Aug 22, 2020 at 2:46 PM Dongwon Kim wrote: > Hi all, > > I'm using Beam 2.23.0 with FlinkRunner and a part of my unbounded pipeline > looks like below: > >> p.apply(WithKeys.of(...).withKeyType(...)) // (A) >> .apply(Window.into(FixedWindows.of(...)))// (B) > > .a

Re: Exceptions: Attempt to deliver a timer to a DoFn, but timers are not supported in Dataflow.

2020-07-29 Thread Kenneth Knowles
Hi Mohil, It helps also to tell us what version of Beam you are using and some more details. This looks related to https://issues.apache.org/jira/browse/BEAM-6855 which claims to be resolved in 2.17.0 Kenn On Mon, Jul 27, 2020 at 11:47 PM Mohil Khare wrote: > Hello all, > > I think I found the

Re: DoFn Timer fire multiple times

2020-07-15 Thread Kenneth Knowles
Hello! What runner are you using? Does this reproduce on multiple runners? (it is very quick to try out your pipeline on DirectRunner and local versions of open source runners like Flink, Spark, etc) If you can produce a complete working reproduction it will be easier for someone to debug. I do n

Re: How to import v2.20.0 to IntelliJ IDEA 2020.1.1?

2020-05-20 Thread Kenneth Knowles
This changed for me recently. I re-run my IntelliJ import regularly just to make sure setting up does not depend on painful manual configuration. My most recent re-run also ended up having no dependencies. I did not track this down or file a bug because I did not have time to confirm it was not jus

Re: Pattern for sharing a resource across all worker threads in Beam Java SDK

2020-05-01 Thread Kenneth Knowles
On Fri, May 1, 2020 at 2:01 PM Luke Cwik wrote: > All worker threads are going to see the same instance. > In Dataflow there is only one JVM containing user code per VM instance. > > One of the reasons around the portability effort is having the worker > nodes/JVMs be consistent across runners so

Re: Notifying the closure of a Window Period

2020-05-01 Thread Kenneth Knowles
the trigger "closing" and dropping data. What version of SDK are you using? Kenn > > .withAllowedLateness(Utilities.resolveDuration(options.getWindowLateness())) > > .discardingFiredPanes()).apply(textWriter); > > > > > > > > *From:* K

Re: Non-trivial joins examples

2020-05-01 Thread Kenneth Knowles
+dev @beam and some people who I talk about joins with Interesting! It is a lot to take in and fully grok the code, so calling in reinforcements... Generally, I think there's agreement that for a lot of real use cases, you have to roll your own join using the lower level Beam primitives. So I thi

Re: Notifying the closure of a Window Period

2020-05-01 Thread Kenneth Knowles
; > createTriggerFile(/*tigger file name*/ ".trigger"); *//writes > to the same directory of the current window. This fires multiple time > depending on the number of panes that have isLast() is true/ or write > operators (not sure exactly).* > > r

Re: Notifying the closure of a Window Period

2020-05-01 Thread Kenneth Knowles
I am guessing you will be affected by the fact that windows processed independently for each key. Is that what you are referring to when you mention multiple isLast() windows? Kenn On Fri, May 1, 2020 at 3:36 AM Truebody, Kyle wrote: > Hi all, > > > > We are working on a streaming pipeline that

Re: Stateful & Timely Call

2020-04-22 Thread Kenneth Knowles
The definition of batch mode for Dataflow is this: completely compute the result of one stage of computation before starting the next stage. There is no way around this. It does not have to do with using state and timers. If you are working with state & timers & triggers, and you are hoping for ou

Re: Running NexMark Tests

2020-04-21 Thread Kenneth Knowles
We should always want to shut down sources on final watermark. All incoming data should be dropped anyhow. Kenn On Tue, Apr 21, 2020 at 1:34 PM Luke Cwik wrote: > +dev > > When would we not want --shutdownSourcesOnFinalWatermark=true ? > > On Tue, Apr 21, 2020 at 1:22 PM Ismaël Mejía wrote: >

Re: Recommended Reading for Apache Beam

2020-04-21 Thread Kenneth Knowles
I believe Streaming Systems is the most Beam-oriented book available. Kenn On Mon, Apr 20, 2020 at 3:07 PM Rion Williams wrote: > Hi all, > > I posed this question over on the Apache Slack Community however didn't > get much of a response so I thought I'd reach out here. I've been looking > for

Re: Distributed Tracing in Apache Beam

2020-04-21 Thread Kenneth Knowles
+dev I don't have a ton of time to dig in to this, but I wanted to say that this is very cool and just drop a couple pointers (which you may already know about) like Explaining Outputs in Modern Data Analytics [1] which was covered by The Morning Paper [2]. This just happens to be something I rea

Re: Global Window

2020-04-15 Thread Kenneth Knowles
In batch, with all bounded data, processing time timers are typically not processed. This is because the window is first fully processed and expired. Can you explain a bit more about why you want a processing time timer in your use case? Kenn On Wed, Apr 15, 2020 at 9:41 PM Aniruddh Sharma wrot

Re: Using Self signed root ca for https connection in eleasticsearchIO

2020-04-07 Thread Kenneth Knowles
Hi Mohil, Thanks for the detailed report. I think most people are reduced capacity right now. Filing a Jira would be helpful for tracking this. Since I am writing, I will add a quick guess, but we should move to Jira. It seems this has more to do with Dataflow than ElasticSearch. The default for

Re: A new reworked Elasticsearch 7+ IO module

2020-03-06 Thread Kenneth Knowles
Since the user provides backendVersion, here are some possible levels of things to add in expand() based on that (these are extra niceties beyond the agreed number of releases to remove) - WARN for backendVersion < n - reject for backendVersion < n with opt-in pipeline option to keep it working

Re: GCS numShards doubt

2020-03-02 Thread Kenneth Knowles
For bounded data, each bundle becomes a file: https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356 Kenn On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver wrote: > As Luke and Robert indicated, unsetting nu

Re: MapReduceRunner

2020-02-26 Thread Kenneth Knowles
There's a (very old) prototype on a branch at https://github.com/apache/beam/tree/mr-runner. No one has taken up the work of finishing it up to merge it, and there's no one to maintain it. We intend to remove this incomplete runners from the website. Kenn On Wed, Feb 26, 2020 at 3:07 PM Maulik S

Re: Unbounded input join Unbounded input then write to Bounded Sink

2020-02-24 Thread Kenneth Knowles
I think actually it depends on the pipeline. You cannot do it all in SQL, but if you mix Java and SQL I think you can do this. If you write this: pipeline.apply(KafkaIO.read() ... ) .apply(Window.into(FixedWindows.of(1 minute)) .apply(SqlTransform("SELECT ... FROM stream1 JOIN

Re: Implementing custom session with max event/element count

2020-02-12 Thread Kenneth Knowles
I notice that you use the name "IntervalWindow" but you are calling methods that IntervalWindow does not have. Do you have a custom implementation of this class? Do you have a custom coder for your version of IntervalWindow? Kenn On Wed, Feb 12, 2020 at 7:30 PM Jainik Vora wrote: > Hi Everyone,

Re: Stability of Timer.withOutputTimestamp

2020-02-05 Thread Kenneth Knowles
It is definitely too new to be stable in the sense of not even tiny changes to the API / runtime compatibility. However, in my opinion it is so fundamental (and overdue) it will certainly exist in some form. Feel free to use it if you are OK with the possibility of minor compile-time adjustments

Re: BigQuery TIMESTAMP and TimestampedValue()

2020-01-22 Thread Kenneth Knowles
Ah, that's too bad. I wonder why they chose to put " UTC" on the end instead of just a "Z". Other than that, the format is RFC3339 and the iso8601 module does have the extension to use a space instead of a T to separate the date and time. I tested and if you strip the " UTC" then parsing succeeds.

Re: Escalating Checkpoint durations in Flink with SQS Source

2020-01-17 Thread Kenneth Knowles
roups them, > and generates some metrics. The read pipeline will operate fine for. > > I've also included the flink-conf that I was using. > > Stephen > > On Fri, Jan 17, 2020 at 11:47 AM Kenneth Knowles wrote: > > > > Starting with just checking some basic th

Re: Escalating Checkpoint durations in Flink with SQS Source

2020-01-17 Thread Kenneth Knowles
Starting with just checking some basic things: What is the space of keys? Is it constantly growing? Are you explicitly clearing out state from stale keys? In the global window, you don't get any state GC for free. Can you share repro code? Kenn On Fri, Jan 17, 2020 at 8:53 AM Stephen Patel wrot

Re: Question about triggering

2020-01-13 Thread Kenneth Knowles
th the direct runner I will tried that. The size of the > collections is about 600 records. > > Thanks > Regards > > On Thu, Jan 9, 2020 at 11:56 PM Kenneth Knowles wrote: > >> Does it have the same behavior in the direct runner? What are the sizes >> of intermediate PC

Re: Question about triggering

2020-01-09 Thread Kenneth Knowles
Does it have the same behavior in the direct runner? What are the sizes of intermediate PCollections? Kenn On Wed, Jan 8, 2020 at 1:05 PM Andrés Garagiola wrote: > Hi all, > > I'm doing some tests with beam and apache flink. I'm running the code > below: > > public static void main(String[] a

Re: Scio 0.8.0 released

2020-01-09 Thread Kenneth Knowles
So fast! Excellent. On Wed, Jan 8, 2020 at 11:28 AM Robert Bradshaw wrote: > Nice! > > On Wed, Jan 8, 2020 at 10:03 AM Neville Li wrote: > >> Hi all, >> >> We just released Scio 0.8.0. This is based on the most recent Beam 2.17.0 >> release and includes a lot of new features & bug fixes over th

Re: Fanout and OOMs on Dataflow while streaming to BQ

2019-11-22 Thread Kenneth Knowles
+Lukasz Cwik +Chamikara Jayalath It sounds like your high-fanout transform that listens for new files on Pubsub would be a good fit for Splittable DoFn (SDF). It seems like a fairly common use case that could be a useful general contribution to Beam. Kenn On Fri, Nov 22, 2019 at 7:50 AM Jeff K

Re: slides?

2019-11-15 Thread Kenneth Knowles
We have a section for this: https://beam.apache.org/community/presentation-materials/. Right now "Presentation Materials" has the appearance of carefully curated stuff from a core team. That was probably true three years ago, but now it is simply out of date. A lot of the material is so old that i

Re: Multiple triggers contained w/in side input?

2019-11-06 Thread Kenneth Knowles
ke how View.asMap() will crash if you have multiple triggers on a per-key Combine/Sum/etc. > Thanks, > Rahul > > On Wed, Nov 6, 2019 at 11:24 AM Kenneth Knowles wrote: > >> >> >> On Tue, Nov 5, 2019 at 9:29 PM Aaron Dixon wrote: >> >>> From

Re: Multiple triggers contained w/in side input?

2019-11-05 Thread Kenneth Knowles
On Tue, Nov 5, 2019 at 9:29 PM Aaron Dixon wrote: > From https://beam.apache.org/documentation/programming-guide/#side-inputs > > > If the side input has multiple trigger firings, Beam uses the value > from the latest trigger firing. This is particularly useful if you use a > side input with a si

Re: Streaming data from Pubsub to Spanner with Beam dataflow pipeline

2019-10-30 Thread Kenneth Knowles
Moving to user@beam.apache.org, the best mailing list for questions like this. Yes, this kind of workload is a core use case for Beam. If you have a problem, please write to this user list with details. Kenn On Wed, Oct 30, 2019 at 4:07 AM Taher Koitawala wrote: > Hi All, > My curren

Re: Joining PCollections to aggregates of themselves

2019-10-11 Thread Kenneth Knowles
This seems a great example of use of stateful DoFn. It has essentially the same structure as the example on the Beam blog but is more meaningful. Kenn On Fri, Oct 11, 2019 at 12:38 PM Robert Bradshaw wrote: > OK, the only way to do this would be via a non-determanistic stateful > DoFn that buff

Re: Limited join with stop condition

2019-10-10 Thread Kenneth Knowles
Interesting! I agree with Luke that it seems not a great fit for Beam in the most rigorous sense. There are many considerations: 1. We assume ParDo has side effects by default. So the model actual *requires* eager evaluation, not lazy, in order to make all the side effects happen. But for your cas

Re: Beam discarding massive amount of events due to Window object or inner processing

2019-10-08 Thread Kenneth Knowles
This is an unfortunate usability problem with triggers where you can accidentally close the window and drop all data. I think instead, you probably want this trigger: Repeatedly.forever( AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Dura

  1   2   3   >