> On Tue, Apr 1, 2025 at 11:53 AM Robert Bradshaw via user
> wrote:
>>
>> Good question. I think it depends on who else is modifying the SQL database.
>>
>> In the easy case (e.g. everything you want to write to your SQL
>> database comes from the NoSQL source) you cou
Good question. I think it depends on who else is modifying the SQL database.
In the easy case (e.g. everything you want to write to your SQL
database comes from the NoSQL source) you could group (e.g. via a
GroupByKey) on your identifier, filter out duplicates with a
subsequent DoFn, and then writ
Creating your own jar with updated dependencies is the right solution,
to ensure it gets used you may need to set Python's --beam_services
pipeline option, e.g.
--beam_services='{"sdks:java:io:expansion-service:shadowJar":
"/path/to/your.jar"}'
See
https://github.com/apache/beam/blob/release-2.4
I'm in favor of deprecating this and cleaning it up, but it depends on
usage. I suspect it is low (or possibly non-existent, especially as there's
little upside to moving away from the default). I cc'd user@ just in case
anyone wants to chime in there. This may be a good thing to add to our
release
ed. E.g.
https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1009
was
made to handle this scenario.
I'm curious how your case is different than
https://github.com/apache/beam/blob/release-2.60.0/sdks/
Interestingly, the very first prototypes of windows were actually called
buckets, and we thought of applications like geographical grouping and such
in addition to time-based aggregations. For streaming, however, event time
in special in the sense that it's required for aggregation and omnipresent,
Beam implements Windowing itself (via state and timers) rather than
deferring to Flink's implementation.
On Wed, Jun 12, 2024 at 11:55 AM Ruben Vargas wrote:
>
> Hello guys
>
> May be a silly question,
>
> But in the Flink runner, the window implementation uses the Flink
> windowing? Does that me
uming resources in some way? I'm assuming may be is not
> significant.
That is correct, but the resources consumed by an idle operator should
be negligible.
> Thanks.
>
> El El vie, 7 de jun de 2024 a la(s) 3:56 p.m., Robert Bradshaw via user
> escribió:
>>
>> Y
You can always limit the parallelism by assigning a single key to
every element and then doing a grouping or reshuffle[1] on that key
before processing the elements. Even if the operator parallelism for
that step is technically, say, eight, your effective parallelism will
be exactly one.
[1]
http
On Fri, Apr 12, 2024 at 1:39 PM Ruben Vargas
wrote:
> On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote:
> >
> > Here is an example from a book that I'm reading now and it may be
> applicable.
> >
> > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100
> > PYTHON - ord(id[0]) % 100
>
or abs(hash(
Are you draining[1] your pipeline or simply canceling it and starting a new
one? Draining should close open windows and attempt to flush all in-flight
data before shutting down. For PubSub you may also need to read from
subscriptions rather than topics to ensure messages are processed by either
one
it begins
>> to create an error. My speculation is the containers don't recognise each
>> other and get killed by the Flink task manager. I see containers are kept
>> created and killed.
>>
>> Does every multi-language pipeline runs in a separate container?
>>
The Python Local Runner has limited support for streaming pipelines.
For the time being would recommend using Dataflow or Flink (the latter
can be run locally) to try out streaming pipelines.
On Fri, Mar 8, 2024 at 2:11 PM Puertos tavares, Jose J (Canada) via
user wrote:
>
> Hello Hu:
>
>
>
> No
t the flink runner. A flink cluster
> is started locally.
>
> On Thu, 7 Mar 2024 at 12:13, Robert Bradshaw via user
> wrote:
>>
>> Streaming portable pipelines are not yet supported on the Python local
>> runner.
>>
>> On Wed, Mar 6, 2024 at 5:03 PM Jaeh
Streaming portable pipelines are not yet supported on the Python local runner.
On Wed, Mar 6, 2024 at 5:03 PM Jaehyeon Kim wrote:
>
> Hello,
>
> I use the python SDK and my pipeline reads messages from Kafka and transforms
> via SQL. I see two containers are created but it seems that they don't
There is no longer a huge amount of active development going on here,
but implementing a missing function seems like an easy contribution
(lots of examples to follow). Otherwise, definitely worth filing a
feature request as a useful signal for prioritization.
On Mon, Mar 4, 2024 at 4:33 PM Jaehyeo
There is no difference; FlatMapElements is implemented in terms of a
DoFn that invokes context.output multiple times. And, yes, Dataflow
will fuse consecutive operations automatically. So if you have
something like
... -> DoFnA -> DoFnB -> GBK -> DoFnC -> ...
Dataflow will fuse DoFnA and DoFnB to
On Wed, Jan 24, 2024 at 10:48 AM Mark Striebeck
wrote:
>
> If point beam to the local jar, will beam start and also stop the expansion
> service?
Yes it will.
> Thanks
> Mark
>
> On Wed, 24 Jan 2024 at 08:30, Robert Bradshaw via user
> wrote:
>>
>&g
You can also manually designate a replacement jar to be used rather
than fetching the jar from maven, either as a pipeline option or (as
of the next release) as an environment variable. The format is a json
mapping from gradle targets (which is how we identify these jars) to
local files (or urls).
This is probably because you're trying to index into the result of the
GroupByKey in your AnalyzeSession as if it were a list. All that is
promised is that it is an iterable. If it is large enough to merit
splitting over multiple fetches, it won't be a list. (If you need to
index, explicitly conver
Reshuffle is perfectly fine to use if the goal is just to redistribute
work. It's only deprecated as a "checkpointing" mechanism.
On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user
wrote:
>
> For runners that support Reshuffle, it should be safe to use. Its been
> "deprecated" for 7 years,
Nothing problematic is standing out for me in those logs. A job
service and artifact staging service is spun up to allow the job (and
its artifacts) to be submitted, then they are shut down. What are the
actual errors that you are seeing?
On Wed, Jan 3, 2024 at 7:39 AM Lydian wrote:
>
>
> Hi,
>
>
And should it be a list of strings, rather than a string?
On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user
wrote:
> Can you try passing `extra_packages` instead of `extra_package` when
> passing pipeline options as a dict?
>
> On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user <
> user@
Currently error handling is implemented on sinks in an ad-hoc basis
(if at all) but John (cc'd) is looking at improving things here.
On Mon, Dec 4, 2023 at 10:25 AM Juan Romero wrote:
>
> Hi guys. I want to ask you about how to deal with the scenario when the
> target sink (eg: jdbc, kafka, bigq
up a PR
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Is it really toggleable in Java? I imagine that if it's a tog
On Thu, Oct 19, 2023 at 2:00 PM Joey Tran wrote:
>
> For the python SDK, is there somewhere where we document more "advance"
> composite transform operations?
I'm not sure, but
https://beam.apache.org/documentation/programming-guide/ is the
canonical palace information like this should probaby b
ut this probably ties
>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Makes sense tha
d examples don't
>>>>>> have a way to see the outputs of a run example? I have a vague memory
>>>>>> that
>>>>>> there used to be a way to navigate to an output file after it's generated
>>>>>> but not sure if I just dreamt
example? I have a vague memory that
>>>> there used to be a way to navigate to an output file after it's generated
>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>> positive if my runs were actually succeeding or not base
] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>
> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
> user@beam.apache.org> wrote:
>
>> BeamJava and BeamPython have the exact same behavior: transform names
>> within must be distinct [1]. This
BeamJava and BeamPython have the exact same behavior: transform names
within must be distinct [1]. This is because we do not necessarily know at
pipeline construction time if the pipeline will be streaming or batch, or
if it will be updated in the future, so the decision was made to impose
this res
Yes, for sure. This is one of the areas Beam excels vs. more simple tools
like SQL. You can write arbitrary code to iterate over arbitrary structures
in the typical Java/Python/Go/Typescript/Scala/[pick your language] way. In
the Beam nomenclature. UDFs correspond to DoFns and UDAFs correspond to
C
>
>
> Is that assumption correct?
>
>
>
> El El vie, 15 de septiembre de 2023 a la(s) 10:59, Robert Bradshaw via
> user escribió:
>
>> Beam will block on side inputs until at least one value is available (or
>> the watermark has advanced such that we can be sur
Beam will block on side inputs until at least one value is available (or
the watermark has advanced such that we can be sure one will never become
available, which doesn't really apply to the global window case).
After that, workers generally cache the side input value (for performance
reasons) but
rDo, extract it's
>> dofn, wrap it, and return a new ParDo
>>
>> On Fri, Sep 15, 2023, 11:53 AM Robert Bradshaw via user <
>> user@beam.apache.org> wrote:
>>
>>> +1 to looking at composite transforms. You could even have a composite
>>> trans
+1 to looking at composite transforms. You could even have a composite
transform that takes another transform as one of its construction arguments
and whose expand method does pre- and post-processing to the inputs/outputs
before/after applying the transform in question. (You could even implement
t
(As an aside, I think all of these options would make for a great blog post
if anyone is interested in authoring one of those...)
On Fri, Sep 1, 2023 at 9:26 AM Robert Bradshaw wrote:
> You can also use Python's RenderRunner, e.g.
>
> python -m apache_beam.examples.wordcount --output out.txt \
You can also use Python's RenderRunner, e.g.
python -m apache_beam.examples.wordcount --output out.txt \
--runner=apache_beam.runners.render.RenderRunner \
--render_output=pipeline.svg
This also has an interactive mode, triggered by passing --port=N (where 0
can be used to pick an unuse
On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath
wrote:
>
>
> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw
> wrote:
>
>> I would like to figure out a way to get the stream-y interface to work,
>> as I think it's more natural overall.
>>
>> One hypothesis is that if any elements are carrie
I would like to figure out a way to get the stream-y interface to work, as
I think it's more natural overall.
One hypothesis is that if any elements are carried over loop iterations,
there will likely be some that are carried over beyond the loop (after all
the callee doesn't know when the loop is
Ah, I see.
Yeah, I've thought about using an iterable for the whole bundle rather than
start/finish bundle callbacks, but one of the questions is how that would
impact implicit passing of the timestamp (and other) metadata from
input elements to output elements. (You can of course attach the metad
Neat.
Nothing like writing and SDK to actually understand how the FnAPI works :).
I like the use of groupBy. I have to admit I'm a bit mystified by the
syntax for parDo (I don't know swift at all which is probably tripping me
up). The addition of external (cross-language) transforms could let you
ain for the help,
>>>>> Joey
>>>>>
>>>>> On Fri, Jun 23, 2023 at 8:34 PM Chamikara Jayalath
>>>>> wrote:
>>>>>>
>>>>>> Another advantage of a portable runner would be that it will be using
>
Your SDF looks fine. I wonder if there is an issue with how Flink is
implementing SDFs (e.g. not garbage collecting previous remainders).
On Tue, Jul 18, 2023 at 5:43 PM Nimalan Mahendran
wrote:
>
> Hello,
>
> I am running a pipeline built in the Python SDK that reads from a Redis
> stream via a
ing
>>>> well defined and backwards compatible Beam portable APIs to communicate
>>>> with SDKs. I think this is specially important for runners that do not live
>>>> in the Beam repo since otherwise future SDK releases could break your
>>>> runner in subtle
Contributions welcome! I don't think we're at the point we can stop
supporting Pandas 1.x though, so we'd have to do it in such a way as to
support both.
On Wed, Jul 12, 2023 at 4:53 PM XQ Hu via user wrote:
> https://github.com/apache/beam/issues/27221#issuecomment-1603626880
>
> This tracks th
Also portability gives you more flexibility when it
>> comes to choosing an SDK to define the pipeline and will allow you to
>> execute transforms in any SDK via cross-language.
>>
>> Thanks,
>> Cham
>>
>> On Fri, Jun 23, 2023 at 1:57 PM Robert Bradshaw via
ly
>> implementing a distributed shuffle, starting and assigning work to multiple
>> workers, etc. but presumably that's the kind of thing your execution
>> environment already takes care of.
>>
>> As for some more concrete pointers, you could probably leverage a l
ur compute infrastructure. (If
you're not doing streaming, this is much more straightforward than all the
bundler scheduler stuff that currently exists in that code).
>
>
>
>
>
> On Fri, Jun 23, 2023 at 12:17 PM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
&
On Fri, Jun 23, 2023, 7:37 AM Alexey Romanenko
wrote:
> If Beam Runner Authoring Guide is rather high-level for you, then, at
> fist, I’d suggest to answer two questions for yourself:
> - Am I going to implement a portable runner or native one?
>
The answer to this should be portable, as non-por
> [1] https://gist.github.com/egalpin/162a04b896dc7be1d0899acf17e676b3
>
> On Thu, May 25, 2023 at 2:25 PM Robert Bradshaw via user <
> user@beam.apache.org> wrote:
>
>> The GbkBeforeStatefulParDo is an implementation detail used to send all
>> elements with the same key to the same
The GbkBeforeStatefulParDo is an implementation detail used to send all
elements with the same key to the same worker (so that they can share
state, which is itself partitioned by worker). This does cause a global
barrier in batch pipelines.
On Thu, May 25, 2023 at 2:15 PM Evan Galpin wrote:
> H
Generally these types of vulnerabilities are only exploitable when
processing untrusted data and/or exposing a public service to the
internet. This is not the typical use of Beam (especially the latter),
but that's not to say Beam can't be used in this way. That being said,
it's preferable to simpl
On Fri, Apr 21, 2023 at 3:37 AM Pavel Solomin wrote:
>
> Thank you for the information.
>
> I'm assuming you had a unique ID in records, and you observed some IDs
> missing in Beam output comparing with Spark, and not just some duplicates
> produced by Spark.
>
> If so, I would suggest to create
You are correct in that the data may arrive in an unordered way.
However, once a window finishes, you are guaranteed to have seen all
the data up to that point (modulo late data) and can then confidently
compute your ordered cumulative sum.
You could do something like this:
def cumulative_sums(ke
Docker is not necessary to expand the transform (indeed, by default it
should just pull the Jar and invokes that directly to start the expansion
service), but it is used as the environment in which to execute the
expanded transform.
It would be in theory possible to run the worker without docker a
;>>> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax wrote:
>>>>>
>>>>>> Are you running on the Dataflow runner? If so, Dataflow - unlike
>>>>>> Spark and Flink - dynamically modifies the parallelism as the operator
>>>>>> runs, so there
What are you trying to achieve by setting the parallelism?
On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang wrote:
> Thanks Reuven, what I mean is to set the parallelism in operator level.
> And the input size of the operator is unknown at compiling stage if it is
> not a source
> operator,
>
> Here'
That is correct.
On Tue, Apr 11, 2023 at 5:44 AM Hans Hartmann wrote:
>
> Hello,
>
> i'm wondering if Apache Beam is using the message guarantees of the
> execution engines, that the pipeline is running on.
>
> So if i use the SparkRunner the consistency guarantees are exactly-once?
>
> Have a g
On Mon, Mar 13, 2023 at 11:33 AM Godefroy Clair
wrote:
> Hi,
> I am wondering about the way `Flatten()` and `FlatMap()` are implemented
> in Apache Beam Python.
> In most functional languages, FlatMap() is the same as composing
> `Flatten()` and `Map()` as indicated by the name, so Flatten() and
Whenever state is used, the runner will arrange such that the same
keys will all go to the same worker, which often involves injecting a
shuffle-like operation if the keys are spread out among many workers
in the input. (An alternative implementation could involve storing the
state in a distributed
Seams reasonable to me.
On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote:
>
> As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped
> being built and supported since July 2022. I have filed [2] to track the
> resolution of this issue.
>
> Based upon [1], almost eve
You should be able to omit the environment_type and environment_config
variables and they will be populated automatically. For running
locally, the flink_master parameter is not needed either (one will be
started up automatically).
On Fri, Feb 3, 2023 at 12:51 PM Talat Uyarer via user
wrote:
>
>
I'm also not sure it's part of the contract that the containerization
technology we use will always have these capabilities.
On Mon, Jan 30, 2023 at 10:53 AM Chad Dombrova wrote:
>
> Hi Valentyn,
>
>>
>> Beam SDK docker containers on Dataflow VMs are currently launched in
>> privileged mode.
>
>
Different idea: is it possible to serve this data via another protocol
(e.g. sftp) rather than requiring a mount?
On Mon, Jan 30, 2023 at 9:26 AM Chad Dombrova wrote:
>
> Hi Robert,
> I know very little about the FileSystem classes, but I don’t think it’s
> possible for a process running in dock
If it's your input/output data, presumably you could implement a
https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystem.html
for nfs. (I don't know what all that would entail...)
On Mon, Jan 30, 2023 at 9:04 AM Chad Dombrova wrote:
>
> Hi Israel,
> Thanks for responding.
66 matches
Mail list logo