Re: [Question] Redirect Write Failures to Dead Letter Queue

2025-04-01 Thread Robert Bradshaw via user
> On Tue, Apr 1, 2025 at 11:53 AM Robert Bradshaw via user > wrote: >> >> Good question. I think it depends on who else is modifying the SQL database. >> >> In the easy case (e.g. everything you want to write to your SQL >> database comes from the NoSQL source) you cou

Re: [Question] Redirect Write Failures to Dead Letter Queue

2025-04-01 Thread Robert Bradshaw via user
Good question. I think it depends on who else is modifying the SQL database. In the easy case (e.g. everything you want to write to your SQL database comes from the NoSQL source) you could group (e.g. via a GroupByKey) on your identifier, filter out duplicates with a subsequent DoFn, and then writ

Re: Specifying java dependencies for multi-language Beam pipeline (using java transforms from python)

2025-03-26 Thread Robert Bradshaw via user
Creating your own jar with updated dependencies is the right solution, to ensure it gets used you may need to set Python's --beam_services pipeline option, e.g. --beam_services='{"sdks:java:io:expansion-service:shadowJar": "/path/to/your.jar"}' See https://github.com/apache/beam/blob/release-2.4

Re: [Discussion] Deprecate ZetaSQL

2025-03-25 Thread Robert Bradshaw via user
I'm in favor of deprecating this and cleaning it up, but it depends on usage. I suspect it is low (or possibly non-existent, especially as there's little upside to moving away from the default). I cc'd user@ just in case anyone wants to chime in there. This may be a good thing to add to our release

Re: Non-time based windowing

2025-02-07 Thread Robert Bradshaw via user
ed. E.g. https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1009 was made to handle this scenario. I'm curious how your case is different than https://github.com/apache/beam/blob/release-2.60.0/sdks/

Re: Non-time based windowing

2025-02-05 Thread Robert Bradshaw via user
Interestingly, the very first prototypes of windows were actually called buckets, and we thought of applications like geographical grouping and such in addition to time-based aggregations. For streaming, however, event time in special in the sense that it's required for aggregation and omnipresent,

Re: How windowing is implemented on Flink runner

2024-06-12 Thread Robert Bradshaw via user
Beam implements Windowing itself (via state and timers) rather than deferring to Flink's implementation. On Wed, Jun 12, 2024 at 11:55 AM Ruben Vargas wrote: > > Hello guys > > May be a silly question, > > But in the Flink runner, the window implementation uses the Flink > windowing? Does that me

Re: Paralalelism of a side input

2024-06-12 Thread Robert Bradshaw via user
uming resources in some way? I'm assuming may be is not > significant. That is correct, but the resources consumed by an idle operator should be negligible. > Thanks. > > El El vie, 7 de jun de 2024 a la(s) 3:56 p.m., Robert Bradshaw via user > escribió: >> >> Y

Re: Paralalelism of a side input

2024-06-07 Thread Robert Bradshaw via user
You can always limit the parallelism by assigning a single key to every element and then doing a grouping or reshuffle[1] on that key before processing the elements. Even if the operator parallelism for that step is technically, say, eight, your effective parallelism will be exactly one. [1] http

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Robert Bradshaw via user
On Fri, Apr 12, 2024 at 1:39 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > > > Here is an example from a book that I'm reading now and it may be > applicable. > > > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > > PYTHON - ord(id[0]) % 100 > or abs(hash(

Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Robert Bradshaw via user
Are you draining[1] your pipeline or simply canceling it and starting a new one? Draining should close open windows and attempt to flush all in-flight data before shutting down. For PubSub you may also need to read from subscriptions rather than topics to ensure messages are processed by either one

Re: Fails to run two multi-language pipelines locally?

2024-03-08 Thread Robert Bradshaw via user
it begins >> to create an error. My speculation is the containers don't recognise each >> other and get killed by the Flink task manager. I see containers are kept >> created and killed. >> >> Does every multi-language pipeline runs in a separate container? >>

Re: [Question] Python Streaming Pipeline Support

2024-03-08 Thread Robert Bradshaw via user
The Python Local Runner has limited support for streaming pipelines. For the time being would recommend using Dataflow or Flink (the latter can be run locally) to try out streaming pipelines. On Fri, Mar 8, 2024 at 2:11 PM Puertos tavares, Jose J (Canada) via user wrote: > > Hello Hu: > > > > No

Re: Fails to run two multi-language pipelines locally?

2024-03-06 Thread Robert Bradshaw via user
t the flink runner. A flink cluster > is started locally. > > On Thu, 7 Mar 2024 at 12:13, Robert Bradshaw via user > wrote: >> >> Streaming portable pipelines are not yet supported on the Python local >> runner. >> >> On Wed, Mar 6, 2024 at 5:03 PM Jaeh

Re: Fails to run two multi-language pipelines locally?

2024-03-06 Thread Robert Bradshaw via user
Streaming portable pipelines are not yet supported on the Python local runner. On Wed, Mar 6, 2024 at 5:03 PM Jaehyeon Kim wrote: > > Hello, > > I use the python SDK and my pipeline reads messages from Kafka and transforms > via SQL. I see two containers are created but it seems that they don't

Re: Roadmap of Calcite support on Beam SQL?

2024-03-04 Thread Robert Bradshaw via user
There is no longer a huge amount of active development going on here, but implementing a missing function seems like an easy contribution (lots of examples to follow). Otherwise, definitely worth filing a feature request as a useful signal for prioritization. On Mon, Mar 4, 2024 at 4:33 PM Jaehyeo

Re: ParDo(DoFn) with multiple context.output vs FlatMapElements

2024-01-26 Thread Robert Bradshaw via user
There is no difference; FlatMapElements is implemented in terms of a DoFn that invokes context.output multiple times. And, yes, Dataflow will fuse consecutive operations automatically. So if you have something like ... -> DoFnA -> DoFnB -> GBK -> DoFnC -> ... Dataflow will fuse DoFnA and DoFnB to

Re: Downloading and executing addition jar file when using Python API

2024-01-24 Thread Robert Bradshaw via user
On Wed, Jan 24, 2024 at 10:48 AM Mark Striebeck wrote: > > If point beam to the local jar, will beam start and also stop the expansion > service? Yes it will. > Thanks > Mark > > On Wed, 24 Jan 2024 at 08:30, Robert Bradshaw via user > wrote: >> >&g

Re: Downloading and executing addition jar file when using Python API

2024-01-24 Thread Robert Bradshaw via user
You can also manually designate a replacement jar to be used rather than fetching the jar from maven, either as a pipeline option or (as of the next release) as an environment variable. The format is a json mapping from gradle targets (which is how we identify these jars) to local files (or urls).

Re: TypeError: '_ConcatSequence' object is not subscriptable

2024-01-22 Thread Robert Bradshaw via user
This is probably because you're trying to index into the result of the GroupByKey in your AnalyzeSession as if it were a list. All that is promised is that it is an iterable. If it is large enough to merit splitting over multiple fetches, it won't be a list. (If you need to index, explicitly conver

Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread Robert Bradshaw via user
Reshuffle is perfectly fine to use if the goal is just to redistribute work. It's only deprecated as a "checkpointing" mechanism. On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user wrote: > > For runners that support Reshuffle, it should be safe to use. Its been > "deprecated" for 7 years,

Re: How to debug ArtifactStagingService ?

2024-01-05 Thread Robert Bradshaw via user
Nothing problematic is standing out for me in those logs. A job service and artifact staging service is spun up to allow the job (and its artifacts) to be submitted, then they are shut down. What are the actual errors that you are seeing? On Wed, Jan 3, 2024 at 7:39 AM Lydian wrote: > > > Hi, > >

Re: Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Robert Bradshaw via user
And should it be a list of strings, rather than a string? On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user wrote: > Can you try passing `extra_packages` instead of `extra_package` when > passing pipeline options as a dict? > > On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user < > user@

Re: Streaming management exception in the sink target.

2023-12-05 Thread Robert Bradshaw via user
Currently error handling is implemented on sinks in an ad-hoc basis (if at all) but John (cc'd) is looking at improving things here. On Mon, Dec 4, 2023 at 10:25 AM Juan Romero wrote: > > Hi guys. I want to ask you about how to deal with the scenario when the > target sink (eg: jdbc, kafka, bigq

Re: [QUESTION] Why no auto labels?

2023-10-20 Thread Robert Bradshaw via user
up a PR >>>>>>>> >>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Is it really toggleable in Java? I imagine that if it's a tog

Re: Advanced Composite Transform Documentation

2023-10-19 Thread Robert Bradshaw via user
On Thu, Oct 19, 2023 at 2:00 PM Joey Tran wrote: > > For the python SDK, is there somewhere where we document more "advance" > composite transform operations? I'm not sure, but https://beam.apache.org/documentation/programming-guide/ is the canonical palace information like this should probaby b

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via user
ut this probably ties >>>>>>> into re-thinking how pipeline update should work.) >>>>>>> >>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran >>>>>>> wrote: >>>>>>> >>>>>>>> Makes sense tha

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via user
d examples don't >>>>>> have a way to see the outputs of a run example? I have a vague memory >>>>>> that >>>>>> there used to be a way to navigate to an output file after it's generated >>>>>> but not sure if I just dreamt

Re: [QUESTION] Why no auto labels?

2023-10-10 Thread Robert Bradshaw via user
example? I have a vague memory that >>>> there used to be a way to navigate to an output file after it's generated >>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't >>>> positive if my runs were actually succeeding or not base

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Robert Bradshaw via user
] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW > > On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user < > user@beam.apache.org> wrote: > >> BeamJava and BeamPython have the exact same behavior: transform names >> within must be distinct [1]. This

Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Robert Bradshaw via user
BeamJava and BeamPython have the exact same behavior: transform names within must be distinct [1]. This is because we do not necessarily know at pipeline construction time if the pipeline will be streaming or batch, or if it will be updated in the future, so the decision was made to impose this res

Re: UDF/UADF over complex structures

2023-09-28 Thread Robert Bradshaw via user
Yes, for sure. This is one of the areas Beam excels vs. more simple tools like SQL. You can write arbitrary code to iterate over arbitrary structures in the typical Java/Python/Go/Typescript/Scala/[pick your language] way. In the Beam nomenclature. UDFs correspond to DoFns and UDAFs correspond to C

Re: [Question] Side Input pattern

2023-09-15 Thread Robert Bradshaw via user
> > > Is that assumption correct? > > > > El El vie, 15 de septiembre de 2023 a la(s) 10:59, Robert Bradshaw via > user escribió: > >> Beam will block on side inputs until at least one value is available (or >> the watermark has advanced such that we can be sur

Re: [Question] Side Input pattern

2023-09-15 Thread Robert Bradshaw via user
Beam will block on side inputs until at least one value is available (or the watermark has advanced such that we can be sure one will never become available, which doesn't really apply to the global window case). After that, workers generally cache the side input value (for performance reasons) but

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Robert Bradshaw via user
rDo, extract it's >> dofn, wrap it, and return a new ParDo >> >> On Fri, Sep 15, 2023, 11:53 AM Robert Bradshaw via user < >> user@beam.apache.org> wrote: >> >>> +1 to looking at composite transforms. You could even have a composite >>> trans

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Robert Bradshaw via user
+1 to looking at composite transforms. You could even have a composite transform that takes another transform as one of its construction arguments and whose expand method does pre- and post-processing to the inputs/outputs before/after applying the transform in question. (You could even implement t

Re: Options for visualizing the pipeline DAG

2023-09-01 Thread Robert Bradshaw via user
(As an aside, I think all of these options would make for a great blog post if anyone is interested in authoring one of those...) On Fri, Sep 1, 2023 at 9:26 AM Robert Bradshaw wrote: > You can also use Python's RenderRunner, e.g. > > python -m apache_beam.examples.wordcount --output out.txt \

Re: Options for visualizing the pipeline DAG

2023-09-01 Thread Robert Bradshaw via user
You can also use Python's RenderRunner, e.g. python -m apache_beam.examples.wordcount --output out.txt \ --runner=apache_beam.runners.render.RenderRunner \ --render_output=pipeline.svg This also has an interactive mode, triggered by passing --port=N (where 0 can be used to pick an unuse

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath wrote: > > > On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw > wrote: > >> I would like to figure out a way to get the stream-y interface to work, >> as I think it's more natural overall. >> >> One hypothesis is that if any elements are carrie

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
I would like to figure out a way to get the stream-y interface to work, as I think it's more natural overall. One hypothesis is that if any elements are carried over loop iterations, there will likely be some that are carried over beyond the loop (after all the callee doesn't know when the loop is

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
Ah, I see. Yeah, I've thought about using an iterable for the whole bundle rather than start/finish bundle callbacks, but one of the questions is how that would impact implicit passing of the timestamp (and other) metadata from input elements to output elements. (You can of course attach the metad

Re: [Request for Feedback] Swift SDK Prototype

2023-08-23 Thread Robert Bradshaw via user
Neat. Nothing like writing and SDK to actually understand how the FnAPI works :). I like the use of groupBy. I have to admit I'm a bit mystified by the syntax for parDo (I don't know swift at all which is probably tripping me up). The addition of external (cross-language) transforms could let you

Re: Getting Started With Implementing a Runner

2023-07-24 Thread Robert Bradshaw via user
ain for the help, >>>>> Joey >>>>> >>>>> On Fri, Jun 23, 2023 at 8:34 PM Chamikara Jayalath >>>>> wrote: >>>>>> >>>>>> Another advantage of a portable runner would be that it will be using >

Re: Growing checkpoint size with Python SDF for reading from Redis streams

2023-07-20 Thread Robert Bradshaw via user
Your SDF looks fine. I wonder if there is an issue with how Flink is implementing SDFs (e.g. not garbage collecting previous remainders). On Tue, Jul 18, 2023 at 5:43 PM Nimalan Mahendran wrote: > > Hello, > > I am running a pipeline built in the Python SDK that reads from a Redis > stream via a

Re: Getting Started With Implementing a Runner

2023-07-14 Thread Robert Bradshaw via user
ing >>>> well defined and backwards compatible Beam portable APIs to communicate >>>> with SDKs. I think this is specially important for runners that do not live >>>> in the Beam repo since otherwise future SDK releases could break your >>>> runner in subtle

Re: Pandas 2 Timeline Estimate

2023-07-12 Thread Robert Bradshaw via user
Contributions welcome! I don't think we're at the point we can stop supporting Pandas 1.x though, so we'd have to do it in such a way as to support both. On Wed, Jul 12, 2023 at 4:53 PM XQ Hu via user wrote: > https://github.com/apache/beam/issues/27221#issuecomment-1603626880 > > This tracks th

Re: Getting Started With Implementing a Runner

2023-07-10 Thread Robert Bradshaw via user
Also portability gives you more flexibility when it >> comes to choosing an SDK to define the pipeline and will allow you to >> execute transforms in any SDK via cross-language. >> >> Thanks, >> Cham >> >> On Fri, Jun 23, 2023 at 1:57 PM Robert Bradshaw via

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
ly >> implementing a distributed shuffle, starting and assigning work to multiple >> workers, etc. but presumably that's the kind of thing your execution >> environment already takes care of. >> >> As for some more concrete pointers, you could probably leverage a l

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
ur compute infrastructure. (If you're not doing streaming, this is much more straightforward than all the bundler scheduler stuff that currently exists in that code). > > > > > > On Fri, Jun 23, 2023 at 12:17 PM Alexey Romanenko < > aromanenko@gmail.com> wrote: > &

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
On Fri, Jun 23, 2023, 7:37 AM Alexey Romanenko wrote: > If Beam Runner Authoring Guide is rather high-level for you, then, at > fist, I’d suggest to answer two questions for yourself: > - Am I going to implement a portable runner or native one? > The answer to this should be portable, as non-por

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Robert Bradshaw via user
> [1] https://gist.github.com/egalpin/162a04b896dc7be1d0899acf17e676b3 > > On Thu, May 25, 2023 at 2:25 PM Robert Bradshaw via user < > user@beam.apache.org> wrote: > >> The GbkBeforeStatefulParDo is an implementation detail used to send all >> elements with the same key to the same

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Robert Bradshaw via user
The GbkBeforeStatefulParDo is an implementation detail used to send all elements with the same key to the same worker (so that they can share state, which is itself partitioned by worker). This does cause a global barrier in batch pipelines. On Thu, May 25, 2023 at 2:15 PM Evan Galpin wrote: > H

Re: [EXTERNAL] Re: Vulnerabilities in Transitive dependencies

2023-05-02 Thread Robert Bradshaw via user
Generally these types of vulnerabilities are only exploitable when processing untrusted data and/or exposing a public service to the internet. This is not the typical use of Beam (especially the latter), but that's not to say Beam can't be used in this way. That being said, it's preferable to simpl

Re: How Beam Pipeline Handle late events

2023-04-24 Thread Robert Bradshaw via user
On Fri, Apr 21, 2023 at 3:37 AM Pavel Solomin wrote: > > Thank you for the information. > > I'm assuming you had a unique ID in records, and you observed some IDs > missing in Beam output comparing with Spark, and not just some duplicates > produced by Spark. > > If so, I would suggest to create

Re: [Question] - Time series - cumulative sum in right order with python api in a batch process

2023-04-24 Thread Robert Bradshaw via user
You are correct in that the data may arrive in an unordered way. However, once a window finishes, you are guaranteed to have seen all the data up to that point (modulo late data) and can then confidently compute your ordered cumulative sum. You could do something like this: def cumulative_sums(ke

Re: Avoid using docker when I use a external transformation

2023-04-18 Thread Robert Bradshaw via user
Docker is not necessary to expand the transform (indeed, by default it should just pull the Jar and invokes that directly to start the expansion service), but it is used as the environment in which to execute the expanded transform. It would be in theory possible to run the worker without docker a

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-18 Thread Robert Bradshaw via user
;>>> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax wrote: >>>>> >>>>>> Are you running on the Dataflow runner? If so, Dataflow - unlike >>>>>> Spark and Flink - dynamically modifies the parallelism as the operator >>>>>> runs, so there

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Robert Bradshaw via user
What are you trying to achieve by setting the parallelism? On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang wrote: > Thanks Reuven, what I mean is to set the parallelism in operator level. > And the input size of the operator is unknown at compiling stage if it is > not a source > operator, > > Here'

Re: Message guarantees

2023-04-14 Thread Robert Bradshaw via user
That is correct. On Tue, Apr 11, 2023 at 5:44 AM Hans Hartmann wrote: > > Hello, > > i'm wondering if Apache Beam is using the message guarantees of the > execution engines, that the pipeline is running on. > > So if i use the SparkRunner the consistency guarantees are exactly-once? > > Have a g

Re: Why is FlatMap different from composing Flatten and Map?

2023-03-15 Thread Robert Bradshaw via user
On Mon, Mar 13, 2023 at 11:33 AM Godefroy Clair wrote: > Hi, > I am wondering about the way `Flatten()` and `FlatMap()` are implemented > in Apache Beam Python. > In most functional languages, FlatMap() is the same as composing > `Flatten()` and `Map()` as indicated by the name, so Flatten() and

Re: Deduplicate usage

2023-03-02 Thread Robert Bradshaw via user
Whenever state is used, the runner will arrange such that the same keys will all go to the same worker, which often involves injecting a shuffle-like operation if the keys are spread out among many workers in the input. (An alternative implementation could involve storing the state in a distributed

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-07 Thread Robert Bradshaw via user
Seams reasonable to me. On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote: > > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped > being built and supported since July 2022. I have filed [2] to track the > resolution of this issue. > > Based upon [1], almost eve

Re: How to submit beam python pipeline to GKE flink cluster

2023-02-03 Thread Robert Bradshaw via user
You should be able to omit the environment_type and environment_config variables and they will be populated automatically. For running locally, the flink_master parameter is not needed either (one will be started up automatically). On Fri, Feb 3, 2023 at 12:51 PM Talat Uyarer via user wrote: > >

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
I'm also not sure it's part of the contract that the containerization technology we use will always have these capabilities. On Mon, Jan 30, 2023 at 10:53 AM Chad Dombrova wrote: > > Hi Valentyn, > >> >> Beam SDK docker containers on Dataflow VMs are currently launched in >> privileged mode. > >

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
Different idea: is it possible to serve this data via another protocol (e.g. sftp) rather than requiring a mount? On Mon, Jan 30, 2023 at 9:26 AM Chad Dombrova wrote: > > Hi Robert, > I know very little about the FileSystem classes, but I don’t think it’s > possible for a process running in dock

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
If it's your input/output data, presumably you could implement a https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystem.html for nfs. (I don't know what all that would entail...) On Mon, Jan 30, 2023 at 9:04 AM Chad Dombrova wrote: > > Hi Israel, > Thanks for responding.