Re: [VOTE] Release 2.66.0, release candidate #1

2025-06-23 Thread Jan Lukavský
+1 (binding) Tested Java SDK with FlinkRunner.  Jan On 6/18/25 22:33, XQ Hu via dev wrote: +1 (non-binding). Tested this with a Dataflow ML pipeline: https://github.com/google/dataflow-ml-starter/actions/runs/15742135818/job/44369850231 On Wed, Jun 18, 2025 at 3:32 PM Vitaly Terentyev via de

Re: Introducing Catalogs to Beam SQL

2025-06-16 Thread Jan Lukavský
with "@Internal" before the release was cut. Is that what y'all had in mind? Best, Ahmed On Fri, Jun 13, 2025 at 8:59 AM Jan Lukavský wrote: Hi Ahmed, thanks for such a quick write-up. This is a pretty good start! I left some comments, but if we have time pres

Re: Introducing Catalogs to Beam SQL

2025-06-13 Thread Jan Lukavský
we should agree on the proposed SQL syntax. Jan, I am as a Beam user and a small contributor, I've also been waiting for this feature. And if you don't mind, can we get Ahmed's changes in this version? Thanks On 2025/06/11 18:42:40 Jan Lukavský wrote: >

Re: Introducing Catalogs to Beam SQL

2025-06-11 Thread Jan Lukavský
Hi Ahmed, this is a great effort which is by no doubt greatly needed by the Beam project as a whole. On the other hand I think we should try to establish a way to pull the community into the discussion process. Could you sum up the the PR (not small) into a design document where we can have a

Re: [ANNOUNCE] New Committer: Shunping Huang

2025-06-10 Thread Jan Lukavský
Congrats Shunping! On 6/10/25 13:49, Ahmed Abualsaud via dev wrote: Congrats Shunping!! Very well deserved, thank you for all the contributions! On Sun, Jun 8, 2025 at 4:57 PM Rakesh Kumar wrote: Congratulations Shunping  🎉👏!!! On Sat, Jun 7, 2025, 9:47 AM Danny McCormick via dev

Re: Proposal: Implementing automated stale issue management (173 days inactivity + 7 days warning)

2025-05-27 Thread Jan Lukavský
Hi XQ, +1 generally, this seems to be the approach many other projects follow, so it seems reasonable. One note - the 7 day deadline feels a little too strict. I'd propose to change this to 150 days + 30 days, the total would be the same, but people can have more time to react. Thanks for th

Re: [VOTE] Release 2.65.0, release candidate #2

2025-05-09 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner.  Jan On 5/8/25 16:25, Danny McCormick via dev wrote: +1 (binding) Tested with a few ML pipelines on local and Dataflow runners Thanks, Danny On Thu, May 8, 2025 at 9:45 AM Yi Hu via dev wrote: +1 (non-binding) Tested Dataflow Templa

Re: [DESIGN] Beam Element Extended Metadata and future work

2025-05-08 Thread Jan Lukavský
.java [2] https://github.com/O2-Czech-Republic/proxima-platform/blob/master/beam/core/src/main/java/cz/o2/proxima/beam/core/transforms/retract/RetractPCollection.java On 5/7/25 16:02, Kenneth Knowles wrote: Ah, thanks for pointing it out. Fixed so anyone should be able to read and comment. Kenn On

Re: [DESIGN] Beam Element Extended Metadata and future work

2025-05-06 Thread Jan Lukavský
Hi Radek, thanks for the design docs! The second one seems to require authentication, can you open it, please? Thanks,  Jan On 5/6/25 20:59, Radek Stankiewicz via dev wrote: hi all, We’ve multiple projects in ideation, design or prototypes that share the common problem - need to extend Wi

Re: Beam Infrastructure: Health Status Report for Mar 2025

2025-04-11 Thread Jan Lukavský
Great job, thanks for your work! On 4/11/25 16:01, XQ Hu via dev wrote: Great work, Vitaly and your team! Thanks a lot! On Fri, Apr 11, 2025 at 9:48 AM Vitaly Terentyev via dev wrote: Dear Community, March was a dynamic month for Beam Infrastructure & Health. We began and ended

Re: [ANNOUNCE] New Committer: Vitaly Terentev

2025-03-25 Thread Jan Lukavský
Congrats Vitaly! On 3/24/25 23:43, LDesire wrote: Congraturation Vitaly! 2025. 3. 25. 오전 7:42, Rakesh Kumar via dev 작성:  Congrats Vitaly!!! On Mon, Mar 24, 2025 at 2:37 PM Yi Hu via dev wrote: Congratulations Vitaly! On Mon, Mar 24, 2025 at 3:28 PM Robert Burke wrote:

Re: [DESIGN] Timers, Watermark Holds, Loops, Batch and Drain

2025-03-12 Thread Jan Lukavský
Hi Kenn, thanks for putting this down on paper. This is great initiative as it might touch some core parts of the model we are actually somewhat circling around. I left some comments and I'm looking forward to the broad discussion this definitely deserves.  Jan On 3/11/25 15:46, Kenneth Kno

Re: [VOTE] Release 2.63.0, release candidate #2

2025-02-13 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner.  Jan On 2/12/25 19:46, XQ Hu via dev wrote: +1 (non-binding). Tested the Python SDK with a simple Dataflow ML pipeline: https://github.com/google/dataflow-ml-starter/actions/runs/13291770412/job/37114110434 On Wed, Feb 12, 2025 at 12:05 PM Jac

Re: Using resource hints or annotations for transform expansion

2025-01-20 Thread Jan Lukavský
any other). We need a way to (optionally) change the behavior of a transform that is part of some outer composite. Therefore this should work for FileIO.write(...).addAnnotation(GroupByKey.HUGE) as well. Kenn On Wed, Jan 15, 2025 at 12:36 PM Robert Bradshaw via dev wrote: On Wed, Jan 15

Re: Using resource hints or annotations for transform expansion

2025-01-15 Thread Jan Lukavský
Right now the only way to populate annotations is to override the annotations() method when subclassing PTransform, but this is a limitation that would be very nice to remove (e.g. a with_annotations(key=value) in Python, or withAnnotation(key, value) in java. - Robert On Tue, Jan 14, 2025 at 8:

Re: Using resource hints or annotations for transform expansion

2025-01-14 Thread Jan Lukavský
we document which resource hints impact which runners. Thanks, Danny On Tue, Jan 14, 2025 at 6:02 AM Jan Lukavský wrote: Hi, as part of reviewing [1], I came across a question, which might be solved using resource hints. Seems the usage of th

Using resource hints or annotations for transform expansion

2025-01-14 Thread Jan Lukavský
Hi, as part of reviewing [1], I came across a question, which might be solved using resource hints. Seems the usage of these hints is currently limited, though. I'll explain the case in a few points:  a) a generic implementation of GBK on Spark assumes that all values fit into memory  b) t

Re: [ANNOUNCE] New PMC Member: Danny McCormick

2024-12-20 Thread Jan Lukavský
Congrats Danny! Well deserved! On 12/20/24 21:22, Danny McCormick via dev wrote: Thanks everyone! I'm excited and honored to join! On Fri, Dec 20, 2024 at 3:08 PM Ravi Magham wrote: Congrats Danny ! On Fri, Dec 20, 2024 at 12:03 PM Valentyn Tymofieiev via dev wrote: S

Automatic spotlessApply

2024-12-02 Thread Jan Lukavský
Hi, there is interesting thread in Maven dev list [1], which discusses automatic spotlessApply as part of build. I personally use exactly same approach on other projects - i.e. compileJava depends on spotlessApply. The benefit is that CI rarely fails due to spotless (the only possibility is a

Re: Looking for Spark and Samza Runner user helping with Java11 support

2024-11-26 Thread Jan Lukavský
Hi, issues like this arise from the fact that we have tight coupling between various parts of our ecosystem - the model, the core and runners. We should decouple this and enable runners to have their own release cycles, because anything other will not scale in the long run. We cannot have mor

Re: [VOTE] Release 2.61.0, release candidate #3

2024-11-25 Thread Jan Lukavský
+1 (binding) Validated java SDK with Flink Runner.  Jan On 11/22/24 22:44, Yi Hu via dev wrote: +1 (non-binding) Tested with GCP IO load tests - https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/it/google-cloud-platform Note that DataflowTemplate tests were already validat

Re: [VOTE] Release 2.61.0, release candidate #2

2024-11-20 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner.  Jan On 11/19/24 17:11, XQ Hu via dev wrote: +1 (non-binding). Tested this with a simple Dataflow ML job (https://github.com/google/dataflow-ml-starter/actions/runs/11915579645/job/33206160169). On Mon, Nov 18, 2024 at 9:01 PM Danny McCormick v

Re: [VOTE] Release 2.60.0, release candidate #2

2024-10-16 Thread Jan Lukavský
+1 (binding) Tested Flink Runner with java SDK.  Jan On 10/16/24 01:35, Valentyn Tymofieiev via dev wrote: +1 (binding), checked dataflow containers and ran a few Python pipelines. On Tue, Oct 15, 2024 at 4:22 PM Ahmet Altay via dev wrote: +1 (binding) On Tue, Oct 15, 2024 at 4:1

Re: Query Regarding Customizing Apache Beam for Sequence-Based Workload Processing

2024-10-01 Thread Jan Lukavský
Hi Kenn, unfortunately the support for this annotation is not as good as it could be. AFAIK it is currently supported only on Java Direct, Flink, Spark and DataFlow batch runners. DataFlow streaming does not support this. There was some discussion that the expansion could be implemented by a

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

2024-08-20 Thread Jan Lukavský
Beam on the market - simplicity and correctness should be the key points, because practice shows people tend to misunderstand the streaming concepts (which is absolutely understandable!). Best,  Jan On 8/20/24 14:38, Jan Lukavský wrote: Hi XQ, thanks for starting this discussion! I agree we

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

2024-08-20 Thread Jan Lukavský
Hi XQ, thanks for starting this discussion! I agree we are getting to a point when discussion a major update of Apache Beam might be good idea. Because such window of opportunity happens only once in (quite many) years, I think we should try to use our current experience with the Beam model i

Re: [VOTE] Release 2.58.0, release candidate #2

2024-08-01 Thread Jan Lukavský
+1 (binding) Tested FlinkRunner with Java SDK.  Jan On 7/30/24 21:12, Jack McCluskey via dev wrote: Hi everyone, Please review and vote on the release candidate #2 for the version 2.58.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific com

Re: [VOTE] Release 2.58.0, release candidate #1

2024-07-19 Thread Jan Lukavský
+1 (binding) Validated Flink Runner with Java SDK.  Jan On 7/18/24 16:24, XQ Hu via dev wrote: +1 (non-binding). Tested it with a simple Dataflow ML pipeline and looks good: https://github.com/google/dataflow-ml-starter/actions/runs/9991815564/job/27615395240 On Wed, Jul 17, 2024 at 5:35 PM

Re: [ANNOUNCE] New Committer: XQ Hu

2024-06-24 Thread Jan Lukavský
Congratulations XQ!  Jan On 6/25/24 06:43, Chamikara Jayalath via dev wrote: Congrats XQ! On Mon, Jun 24, 2024 at 9:28 PM Ritesh Ghorse via dev wrote: Congratulations XQ! Well deserved! On Tue, Jun 25, 2024 at 7:37 AM Damon Douglas wrote: Congratulations! 🎉🎉🎉

Re: [VOTE] Release 2.57.0, release candidate #1

2024-06-24 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner 1.18.  Jan On 6/22/24 06:43, Jean-Baptiste Onofré wrote: +1 (binding) Regards JB On Fri, Jun 21, 2024 at 10:17 PM Kenneth Knowles wrote: Hi everyone, Please review and vote on the release candidate #1 for the version 2.57.0, as follows: [ ]

Re: University of Innsbruck - Apache Beam options

2024-06-06 Thread Jan Lukavský
Hi Stefan, the settings in the runner works differently than how you might expect. According to the Beam model, a "bundle" is a unit of work, that either completes successfully as a whole or is restarted from scratch (i.e. it is perfectly fine for the bundle to be processed to the middle, then

Re: design docs that get deleted, etc

2024-05-30 Thread Jan Lukavský
medium (not even wikis) that facilitates conversation/collaborative editing to the extent that docs does, but I agree with the downside that ownership by random individuals can pose a problem. On Wed, May 29, 2024 at 7:07 AM Jan Lukavský wrote: Hi, regarding changing the way we document past (and more

Re: design docs that get deleted, etc

2024-05-29 Thread Jan Lukavský
Hi, regarding changing the way we document past (and more importantly future) changes, I've always been a big fan of the FLIP analogy [1]. I would love if we could make this work for Beam as well, while preserving the 'informal' part that I believe all of us want to keep. On the other hand, t

Flink 1.18 quickstart issues

2024-05-22 Thread Jan Lukavský
Hi, currently our documentation for running quickstart examples uses maven and `exec:java` mojo. Unfortunately, there is a (not yet confirmed, but likely, see [1], [2]) Flink bug which causes Flink runner 1.18 to fail. The bug causes MiniCluster (which is used in local testing) use wrong clas

Re: PCollection#applyWindowingStrategyInternal

2024-05-05 Thread Jan Lukavský
wo windows merge into a single S3. Retractions solve this by hooking into the window merging logic, retracting the outputs for S1 and S2 before outputting S3. I don't think this is possible today with a DSL. On Thu, Apr 25, 2024 at 5:46 AM Jan Lukavský wrote: > To implement retracti

Re: [VOTE] Release 2.56.0, release candidate #2

2024-04-29 Thread Jan Lukavský
+1 (binding). Tested Java SDK with Flink runner.  Jan On 4/28/24 15:32, XQ Hu via dev wrote: +1 (non-binding). Tested it using the dataflow ML pipeline: https://github.com/google/dataflow-ml-starter/actions/runs/8862170843/job/24334816481 On Sat, Apr 27, 2024 at 7:42 AM Danny McCormick via d

Re: [VOTE] Release 2.56.0, release candidate #1

2024-04-26 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner.  Jan On 4/25/24 05:17, XQ Hu via dev wrote: +1 (non binding). Tested the simple Dataflow ML job: https://github.com/google/dataflow-ml-starter/actions/runs/8824985423/job/24228468173 On Wed, Apr 24, 2024 at 2:01 PM Danny McCormick via dev

Re: PCollection#applyWindowingStrategyInternal

2024-04-25 Thread Jan Lukavský
t of the "retraction combine"). On 4/23/24 18:08, Reuven Lax via dev wrote: On Tue, Apr 23, 2024 at 7:52 AM Jan Lukavský wrote: On 4/22/24 20:40, Kenneth Knowles wrote: I'll go ahead and advertise https://s.apache.org/beam-sink-triggers again for this thread.

Re: PCollection#applyWindowingStrategyInternal

2024-04-23 Thread Jan Lukavský
frequency themselves with timers. One could however still propagate the trigger upstream of the stateful ParDo, though I'm not sure if that's the best approach. On Mon, Apr 15, 2024 at 11:31 PM Jan Lukavský wrote: On 4/11/24 18:20, Reuven Lax via dev wrote: I&#x

Re: PCollection#applyWindowingStrategyInternal

2024-04-15 Thread Jan Lukavský
y when the upstream ParDo emits any data. Yes, one can argue that stateful ParDo is supposed to emit data at fast as possible, then this seems to work. On Thu, Apr 11, 2024 at 5:10 AM Jan Lukavský wrote: I've probably heard about it, but I never read the proposal. Sounds great, bu

Re: PCollection#applyWindowingStrategyInternal

2024-04-11 Thread Jan Lukavský
is in contrast to today where we attach triggering to the windowing information. This was a proposal some years back and there was some effort made to implement it, but the implementation never really got off the ground. On Wed, Apr 10, 2024 at 12:43 AM Jan Lukavský wrote: On 4/9/24 18:

Re: PCollection#applyWindowingStrategyInternal

2024-04-10 Thread Jan Lukavský
we sure our windowing and triggering semantics is 100% correct"? Probably the - wrong - expectations at the beginning of this thread were due to conflict in my mental model of how things 'could' work as opposed to how they actually work. :)  Jan Kenn On Tue, Apr 9, 2024 at 9:19 

Re: PCollection#applyWindowingStrategyInternal

2024-04-09 Thread Jan Lukavský
CollectionView class) and very poorly documented, so I doubt many users know about this! Reuven On Sat, Apr 6, 2024 at 7:09 AM Jan Lukavský wrote: Immediate self-correction, although setting the strategy directly via setWindowingStrategyInternal() *seemed* to be working during Pipeline

Re: PCollection#applyWindowingStrategyInternal

2024-04-06 Thread Jan Lukavský
there remains the other question if we can make flattening PCollections with incompatible windowFns more user-friendly. The current approach where we require the same windowFn for all input PCollections creates some unnecessary boilerplate code needed on user side.  Jan On 4/6/24 15:45, Jan

PCollection#applyWindowingStrategyInternal

2024-04-06 Thread Jan Lukavský
Hi, I came across a case where using PCollection#applyWindowingStrategyInternal seems legit in user core. The case is roughly as follows:  a) compute some streaming statistics  b) apply the same transform (say ComputeWindowedAggregation) with different parameters on these statistics yieldin

Re: Patch release proposal

2024-03-28 Thread Jan Lukavský
+1 to either doing full release or deferring to 2.56.0.  Jan On 3/28/24 16:52, Yi Hu via dev wrote: > Just releasing Python can break multi-lang by default (unless expansion service is overridden manually) since we match versions across languages when picking the default expansion service. Y

Re: [VOTE] Release 2.55.0, release candidate #3

2024-03-21 Thread Jan Lukavský
+1 (binding) Tested Java SDK with FlinkRunner.  Jan On 3/20/24 22:40, Chamikara Jayalath via dev wrote: +1 (binding) Tested multi-lang Java/Python pipelines and upgrading BQ/Kafka transforms from 2.53.0 to 2.55.0 using the Transform Service. Thanks, Cham On Tue, Mar 19, 2024 at 2:10 PM XQ

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-29 Thread Jan Lukavský
not create the problem of "vector watermarks". The Throttle transform would then use the backling for feedback loop to slowdown the request rate. On 2/29/24 14:57, Jan Lukavský wrote: From my understanding Flink rate limits based on local information only. On the other hand - in cas

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-29 Thread Jan Lukavský
e such a Cross Worker Communication Pair (the processor/server + DoFn client) could be built, but not purely be limited to Rate limiting/Throttling. Possibly mumble mumble StatePipe? But that feels like a harder problem for the time being. Robert Burke On 2024/02/28 08:25:35 Jan Lukavský wrote: On

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-28 Thread Jan Lukavský
On 2/27/24 19:49, Robert Bradshaw via dev wrote: On Tue, Feb 27, 2024 at 10:39 AM Jan Lukavský wrote: On 2/27/24 19:22, Robert Bradshaw via dev wrote: On Mon, Feb 26, 2024 at 11:45 AM Kenneth Knowles wrote: Pulling out focus points: On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
the event-time progress of individual partitions, so that partitions that are too ahead of time do not blow up downstream state. These might be related concepts. We'd need a discussion of what an SDK must do if the runner doesn't support the central clock for completeness, and co

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
d be correct to not wait at least that long (even in batch inputs, e.g. suppose I'm tailing logs and was eagerly started before they were fully written, or waiting for some kind of (non-data-dependent) quiessence or other operation to finish). On Fri, Feb 23, 2024 at 12:36 AM Jan Luka

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
d need a discussion of what an SDK must do if the runner doesn't support the central clock for completeness, and consistency. On Tue, Feb 27, 2024, 6:58 AM Jan Lukavský wrote: On 2/27/24 14:51, Kenneth Knowles wrote: I very much like the idea of processing time clock as a pa

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
yes, I was just trying to restate the streaming processing time semantics in the limited batch case. Kenn On Tue, Feb 27, 2024 at 2:40 AM Jan Lukavský wrote: I think that before we introduce a possibly somewhat duplicate new feature we should be certain that it is really semantically

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-26 Thread Jan Lukavský
uld > > like to check again) at some time in the processing-time future. I > > can't think of a batch or streaming scenario where it would be correct > > to not wait at least that long (even in batch inputs,

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-23 Thread Jan Lukavský
For me it always helps to seek analogy in our physical reality. Stream processing actually has quite a good analogy for both event-time and processing-time - the simplest model for this being relativity theory. Event-time is the time at which events occur _at distant locations_. Due to finite a

Re: Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-21 Thread Jan Lukavský
t Serializable) could be left in their original packages for backwards compatibility reasons? On Wed, Feb 21, 2024 at 7:32 AM Jan Lukavský wrote: > > Hi, > > while implementing FlinkRunner for Flink 1.17 I tried to verif

Re: Throttle PTransform

2024-02-21 Thread Jan Lukavský
tate when the threshold is exceeded, but severely limiting the state size. However I wouldn't start here - we would want to build the simpler implementation first and see how it performs. +1 On Wed, Feb 21, 2024 at 8:53 AM Robert Bradshaw via dev wrote: On Wed, Feb 21, 2024 at 12:4

Re: Throttle PTransform

2024-02-21 Thread Jan Lukavský
On 2/21/24 17:52, Robert Bradshaw via dev wrote: On Wed, Feb 21, 2024 at 12:48 AM Jan Lukavský wrote: Hi, I have left a note regarding the proposed splitting of batch and streaming expansion of this transform. In general, a need for such split triggers doubts in me. This signals that either

Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-21 Thread Jan Lukavský
Hi, while implementing FlinkRunner for Flink 1.17 I tried to verify that a running Pipeline is able to successfully upgrade from Flink 1.16 to Flink 1.17. There is some change regarding serialization needed for Flink 1.17, so this was a concern. Unfortunately recently we merged core-construct

Re: Throttle PTransform

2024-02-21 Thread Jan Lukavský
Hi, I have left a note regarding the proposed splitting of batch and streaming expansion of this transform. In general, a need for such split triggers doubts in me. This signals that either  a) the transform does something is should not, or  b) Beam model is not complete in terms of being "u

Re: [ANNOUNCE] New Committer: Svetak Sundhar

2024-02-15 Thread Jan Lukavský
Congrats Svetak! On 2/14/24 16:11, Yi Hu via dev wrote: Congrats, Svetak! On Wed, Feb 14, 2024 at 9:50 AM John Casey via dev wrote: Congrats Svetak! On Wed, Feb 14, 2024 at 9:00 AM Ahmed Abualsaud wrote: Congrats Svetak! On 2024/02/14 02:05:02 Priyans Desai v

Re: [VOTE] Release 2.54.0, release candidate #2

2024-02-07 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner.  Jan On 2/7/24 06:23, Robert Burke via dev wrote: Hi everyone, Please review and vote on the release candidate #2 for the version 2.54.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comme

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-01-31 Thread Jan Lukavský
Hi, if I understand this proposal correctly, the motivation is actually reducing latency by bypassing bundle atomic guarantees, bundles after "at least once" Reshuffle would be reconstructed independently of the pre-shuffle bundling. Provided this is correct, it seems that the behavior is sli

Re: @RequiresTimeSortedInput adoption by runners

2024-01-20 Thread Jan Lukavský
is still useful for my Prism work, especially if I see any similar situations once I start on the Java Validates Runner suite. Robert Burke Beam Go Busybody On Fri, Jan 19, 2024, 6:41 AM Jan Lukavský wrote: I was primarily focused on Java SDK (and core-contruction-java), but generally

Re: @RequiresTimeSortedInput adoption by runners

2024-01-19 Thread Jan Lukavský
6]https://github.com/apache/beam/blob/b4c23b32f2b80ce052c8a235e5064c69f37df992/website/www/site/content/en/blog/beam-2.20.0.md?plain=1#L46 On 2024/01/18 16:14:56 Jan Lukavský wrote: Hi, recently I came across the fact that most runners do not support @RequiresTimeSortedInput annotation for sorting

@RequiresTimeSortedInput adoption by runners

2024-01-18 Thread Jan Lukavský
Hi, recently I came across the fact that most runners do not support @RequiresTimeSortedInput annotation for sorting per-key data by event timestamp [1]. Actually, runners supporting it seem to be Direct java, Flink and Dataflow batch (as it is a noop there). The annotation has use-cases in t

Re: [VOTE] Release 2.53.0, release candidate #2

2023-12-28 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner.  Jan On 12/27/23 14:13, Danny McCormick via dev wrote: +1 (non-binding) Tested with some example ML notebooks. Thanks, Danny On Tue, Dec 26, 2023 at 6:41 PM XQ Hu via dev wrote: +1 (non-binding) Tested with the simple RunInference p

Re: [VOTE] Release 2.52.0, release candidate #5

2023-11-15 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/15/23 11:35, Jean-Baptiste Onofré wrote: +1 (binding) Quickly tested Java SDK and checked the legal part (hash, signatures, headers). Regards JB On Tue, Nov 14, 2023 at 12:06 AM Danny McCormick via dev wrote: H

Re: [VOTE] Release 2.52.0, release candidate #4

2023-11-13 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/12/23 00:44, Danny McCormick via dev wrote: Hi everyone, Please review and vote on the release candidate #3 for the version 2.52.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (pleas

Re: [VOTE] Release 2.52.0, release candidate #3

2023-11-09 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/9/23 03:31, Danny McCormick via dev wrote: Hi everyone, Please review and vote on the release candidate #3 for the version 2.52.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please

Re: [VOTE] Release 2.52.0, release candidate #2

2023-11-08 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/8/23 00:24, Danny McCormick via dev wrote: Hi everyone, Please review and vote on the release candidate #2 for the version 2.52.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please

[LAZY CONSENSUS] Deprecate Euphoria extension

2023-11-02 Thread Jan Lukavský
Hi, according to discussion [1], because no objections were raised and the overall usage (artifact download stats) is negligible compared to other Beam artifacts, I'll proceed with deprecating the Euphoria extension, unless there are any objections within 72 hours (excluding weekend). Best,

Re: Processing time watermarks in KinesisIO

2023-11-01 Thread Jan Lukavský
ll together, but let users specify it manually, provided they know the consequences. Jan [1] https://issues.apache.org/jira/browse/BEAM-591 On 10/31/23 21:36, Robert Bradshaw via dev wrote: On Tue, Oct 31, 2023 at 10:28 AM Jan Lukavský wrote: On 10/31/23 17:44, Robert Bradshaw via dev wrote:

Re: Processing time watermarks in KinesisIO

2023-10-31 Thread Jan Lukavský
bit broken if I understand correctly. +1 On Tue, Oct 31, 2023 at 1:16 AM Jan Lukavský wrote: I think that instead of deprecating and creating new version, we could leverage the proposed update compatibility flag for this [1]. I still have some doubts if the processing-time watermarking (and

Re: Processing time watermarks in KinesisIO

2023-10-31 Thread Jan Lukavský
of deprecated /“org.apache.beam.sdk.io.kinesis.KinesisIO”/ one. — Alexey On 27 Oct 2023, at 17:42, Jan Lukavský wrote: No, I'm referring to this [1] policy which has unexpected (and hardly avoidable on the user-code side) data loss issues. The problem is that assigning timest

Re: Processing time watermarks in KinesisIO

2023-10-27 Thread Jan Lukavský
apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kinesis/KinesisIO.Read.html#withProcessingTimeWatermarkPolicy-- On 10/27/23 16:51, Alexey Romanenko wrote: Why not just to create a custom watermark policy for that? Or you mean to make it as a default policy? — Alexey On 27 Oct 2023, at

Processing time watermarks in KinesisIO

2023-10-27 Thread Jan Lukavský
Hi, when discussing about [1] we found out, that the issue is actually caused by processing time watermarks in KinesisIO. Enabling this watermark outputs watermarks based on current processing time, _but event timestamps are derived from ingestion timestamp_. This can cause unbounded lateness

Re: Reshuffle PTransform Design Doc

2023-10-20 Thread Jan Lukavský
7;t targeting to have one Beam primitive for each thing that is probably a runner primitive. On Thu, Oct 19, 2023 at 2:25 PM Kenneth Knowles wrote: On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský wrote: Hi, I think there's been already said nearly everything in this thread, but ... it is time

Re: [YAML] Aggregations

2023-10-19 Thread Jan Lukavský
On 10/19/23 19:41, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský wrote: On 10/19/23 18:28, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: Rill is definitely SQL-oriented but I think that's going to be the most c

Re: [YAML] Aggregations

2023-10-19 Thread Jan Lukavský
On 10/19/23 18:28, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: Rill is definitely SQL-oriented but I think that's going to be the most common. Dataframes are explicitly modeled on the relational approach so that's going to look a lot like SQL, I think pr

Re: KafkaIO does not make use of Kafka Consumer Groups [kafka] [java] [io]

2023-10-18 Thread Jan Lukavský
Hi, my two cents on this. While it would perfectly possible to use consumer group in KafkaIO, it has its own issues. The most visible would be, that using subscriptions might introduce unnecessary duplicates in downstream processing. The reason for this is that consumer in a consumer group mi

Re: [ANNOUNCE] New Committer: Sam Whittle

2023-10-16 Thread Jan Lukavský
Congrats Sam! On 10/16/23 22:34, Austin Bennett wrote: Thanks, Sam! On Mon, Oct 16, 2023 at 12:39 PM XQ Hu via dev wrote: Congratulations! On Mon, Oct 16, 2023 at 1:58 PM Ahmet Altay via dev wrote: Congratulations Sam! On Mon, Oct 16, 2023 at 10:42 AM Byron E

Re: [ANNOUNCE] New Committer: Byron Ellis

2023-10-16 Thread Jan Lukavský
Congrats Byron! On 10/16/23 22:33, Austin Bennett wrote: thanks, Byron! On Mon, Oct 16, 2023 at 12:38 PM XQ Hu via dev wrote: Congratulations! On Mon, Oct 16, 2023 at 1:58 PM Ahmet Altay via dev wrote: Congratulations Byron! On Mon, Oct 16, 2023 at 10:35 AM T

Re: [DISCUSS] Drop Euphoria extension

2023-10-16 Thread Jan Lukavský
Jan On 10/16/23 15:10, Alexey Romanenko wrote: Can we just deprecate it for a while and then remove completely? — Alexey On 13 Oct 2023, at 18:59, Jan Lukavský wrote: Hi, it has been some time since Euphoria extension [1] has been adopted by Beam as a possible "Java 8 API". Beam

[DISCUSS] Drop Euphoria extension

2023-10-13 Thread Jan Lukavský
Hi, it has been some time since Euphoria extension [1] has been adopted by Beam as a possible "Java 8 API". Beam has evolved from that time a lot, the current API seems actually more elegant than the original Euphoria's and last but not least, it has no maintainers and no known users. If ther

Re: Reshuffle PTransform Design Doc

2023-10-13 Thread Jan Lukavský
e: On Fri, Oct 6, 2023 at 3:07 PM Jan Lukavský wrote: On 10/6/23 15:11, Kenneth Knowles wrote: On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský wrote: Hi, there is also one other thing to mention with relation to Reshuffle/RequiresStableinpu

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Jan Lukavský
On 10/6/23 15:11, Kenneth Knowles wrote: On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský wrote: Hi, there is also one other thing to mention with relation to Reshuffle/RequiresStableinput and that is that our current implementation of RequiresStableInput can break without

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Jan Lukavský
Hi, there is also one other thing to mention with relation to Reshuffle/RequiresStableinput and that is that our current implementation of RequiresStableInput can break without Reshuffle in some corner cases on most portable runners, at least with Java GreedyPipelineFuser, see [1]. The only w

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-05 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner on own test-cases.  Jan On 10/4/23 21:10, Bruno Volpato via dev wrote: +1 (non-binding). Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java SDK 11, Dataflow Runner using both legacy and v2). Thanks Kenn! On Wed, Oct 4,

Re: [ANNOUNCE] New PMC Member: Robert Burke

2023-10-04 Thread Jan Lukavský
Congrats Robert! On 10/4/23 10:29, Alexey Romanenko wrote: Congrats Robert, very well deserved! — Alexey On 4 Oct 2023, at 00:39, Austin Bennett wrote: Thanks for all you do @Robert Burke  ! On Tue, Oct 3, 2023 at 12:53 PM Ahmed Abualsaud wrote: Congrats

Re: [ANNOUNCE] New PMC Member: Valentyn Tymofieiev

2023-10-04 Thread Jan Lukavský
Congrats Valentyn! On 10/4/23 10:26, Alexey Romanenko wrote: Congrats Valentyn, very well deserved! — Alexey On 4 Oct 2023, at 00:39, Austin Bennett wrote: Thanks for everything @Valentyn Tymofieiev  ! On Tue, Oct 3, 2023 at 12:53 PM Ahmed Abualsaud wrote:

Re: [ANNOUNCE] New PMC Member: Alex Van Boxel

2023-10-04 Thread Jan Lukavský
Congrats Alex! On 10/4/23 10:29, Alexey Romanenko wrote: Congrats Alex, very well deserved! — Alexey On 4 Oct 2023, at 00:38, Austin Bennett wrote: Thanks for all you do, @Alex Van Boxel  ! On Tue, Oct 3, 2023 at 12:50 PM Ahmed Abualsaud via dev wrote:

Re: Runner Bundling Strategies

2023-09-27 Thread Jan Lukavský
ost whenever bundles retry. On Wed, Sep 27, 2023 at 11:20 AM Jan Lukavský wrote: What is the reason to rely on StartBundle and not Setup in this case? If the life-cycle of bundle is not "closed" (i.e. start - finish), then it seems to be ill defined and Setup should do?

Re: Runner Bundling Strategies

2023-09-27 Thread Jan Lukavský
new restriction on the runner. Technically probably have to include @StartBundle in that consideration. Kenn On Tue, Sep 26, 2023 at 8:54 AM Kenneth Knowles wrote: On Mon, Sep 25, 2023 at 1:19 PM Jan Lukavský wrote:

Re: Runner Bundling Strategies

2023-09-25 Thread Jan Lukavský
eam/issues/28650 On 9/25/23 18:31, Reuven Lax via dev wrote: On Mon, Sep 25, 2023 at 6:19 AM Jan Lukavský wrote: On 9/23/23 18:16, Reuven Lax via dev wrote: Two separate things here: 1. Yes, a watermark can update in the middle of a bundle. 2. The records in the bundle thems

Re: Runner Bundling Strategies

2023-09-25 Thread Jan Lukavský
age will delay watermark propagation until a checkpoint (which is typically the order of seconds). This delay would add up after each stage. Reuven On Sat, Sep 23, 2023 at 12:03 AM Jan Lukavský wrote: > Watermarks shouldn't be (visibly) advanced until @FinishBundle is commit

Re: Runner Bundling Strategies

2023-09-23 Thread Jan Lukavský
ndling watermarks is runner-dependent (e.g. Flink does not store watermarks in checkpoints, they are always recomputed from scratch on restore). [1] https://lists.apache.org/thread/10db7l9bhnhmo484myps723sfxtjwwmv On 9/22/23 21:47, Robert Bradshaw via dev wrote: On Fri, Sep 22, 2023 at 10:

Re: Runner Bundling Strategies

2023-09-22 Thread Jan Lukavský
On 9/22/23 18:07, Robert Bradshaw via dev wrote: On Fri, Sep 22, 2023 at 7:23 AM Byron Ellis via dev wrote: I've actually wondered about this specifically for streaming... if you're writing a pipeline there it seems like you're often going to want to put high fixed cost things lik

Re: Runner Bundling Strategies

2023-09-22 Thread Jan Lukavský
basically) On Fri, Sep 22, 2023 at 5:09 AM Jan Lukavský wrote: Flink defines bundles in terms of number of elements and processing time, by default 1000 elements or 1000 milliseconds, whatever happens first. But bundles are not a "natural" concept in Flink, it uses

  1   2   3   4   5   6   >