[Bug?] Combiner components don't inherit annotations of source CombineByKey

2023-09-15 Thread Joey Tran
While implementing a runner, we tried annotating a CombineByKey transform. I noticed that the annotations for the CBK are then lost in the fusion optimization stage when the CBK is broken into components. Is this intentional? I can put up a PR if this seems worth fixing (it is for us at least), ju

Re: [Bug?] Combiner components don't inherit annotations of source CombineByKey

2023-09-15 Thread Joey Tran
Ended up just filing a PR [1] [1] https://github.com/apache/beam/pull/28489 On Fri, Sep 15, 2023 at 12:51 PM Joey Tran wrote: > While implementing a runner, we tried annotating a CombineByKey transform. > I noticed that the annotations for the CBK are then lost in the fusion > opt

Runner Bundling Strategies

2023-09-21 Thread Joey Tran
Writing a runner and the first strategy for determining bundling size was to just start with a bundle size of one and double it until we reach a size that we expect to take some targets per-bundle runtime (e.g. maybe 10 minutes). I realize that this isn't the greatest strategy for high sized cost t

Re: Runner Bundling Strategies

2023-09-22 Thread Joey Tran
.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements On Thu, Sep 21, 2023 at 7:23 PM Joey Tran wrote: > Writing a runner and the first strategy for determining bundling size was > to just start with a bundle size of one and doubl

Re: Runner Bundling Strategies

2023-09-22 Thread Joey Tran
then it is > the runner's job to put as many calls to @ProcessElement as possible to > amortize. > > Kenn > > On Fri, Sep 22, 2023 at 9:39 AM Joey Tran > wrote: > >> Whoops, I typoed my last email. I meant to write "this isn't the >> greatest st

Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Joey Tran
> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran > wrote: > >> Not sure what that suggests >> >> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user wrote: >> >>> Looks like this is the current behaviour. If you have `t = >>> beam.Filter(identity_filter)`, `t.l

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Joey Tran
; > [1] Note that this applies to the fully qualified transform name, so the > naming only has to be distinct within a composite transform (or at the top > level--the pipeline itself is isomorphic to a single composite transform). > > On Wed, Oct 4, 2023 at 3:43 AM Joey Tran > wro

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Joey Tran
7;s togglable > with an option now. We should probably add the option to toggle Python too. > (Unclear what the default should be, but this probably ties into > re-thinking how pipeline update should work.) > > On Thu, Oct 5, 2023 at 4:58 AM Joey Tran > wrote: > >> Makes

Re: [QUESTION] Why no auto labels?

2023-10-10 Thread Joey Tran
Bump on this. Sorry to pester - I'm trying to get a few teams to adopt Apache Beam at my company and I'm trying to foresee parts of the API they might find inconvenient. If there's a conclusion to make the behavior similar to java, I'm happy to put up a PR On Thu, Oct 5, 2023

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Joey Tran
behavior similar to java, I'm happy >> to put up a PR >> >> On Thu, Oct 5, 2023, 12:49 PM Joey Tran >> wrote: >> >>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be >>> a very sticky toggle since i

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Joey Tran
uld then attribute old B_2's state > to the new B_2 (and also possibly mis-direct any inflight messages). At > least with the old, intersecting names we can detect this problem > rather than silently give corrupt data. > > > On Fri, Oct 13, 2023 at 7:15 AM Joey Tran >

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Joey Tran
On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw wrote: > On Fri, Oct 13, 2023 at 10:08 AM Joey Tran > wrote: > Are there places on the SDK side that expect unique labels? Or in >> non-updateable runners? >> > > That's a good question. The label eventually en

[PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
Hey all, While writing a few pipelines, I was surprised by how few partitioners there were in the python SDK. I wrote a couple that are pretty generic and possibly generally useful. Just wanted to do a quick poll to see if they seem useful enough to be in the sdk's library of transforms. If so, I

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
r PR. > > Thanks, > Danny > > On Thu, Oct 19, 2023 at 10:06 AM Joey Tran > wrote: > >> Hey all, >> >> While writing a few pipelines, I was surprised by how few partitioners >> there were in the python SDK. I wrote a couple that are pretty generic and >&

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
/github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191 > > On Thu, Oct 19, 2023 at 3:21 PM Joey Tran > wrote: > >> Yes, both need to be small enough to fit into state. >> >> Yeah a percentage sam

Re: [QUESTION] Why no auto labels?

2023-10-20 Thread Joey Tran
ed [1] https://github.com/apache/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630 On Fri, Oct 13, 2023 at 1:32 PM Joey Tran wrote: > > > On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw > wrote: > >> On Fri,

Re: [PYTHON] partitioner utilities?

2023-10-23 Thread Joey Tran
PR for top: https://github.com/apache/beam/pull/29106 On Mon, Oct 23, 2023 at 10:11 AM XQ Hu via dev wrote: > +1 on this idea. Thanks! > > On Thu, Oct 19, 2023 at 3:40 PM Joey Tran > wrote: > >> Yeah, I already implemented these partitioners for my use case (I just >

Hiding logging for beam playground examples

2023-11-14 Thread Joey Tran
Hi all, I just had a workshop to demo beam for people at my company and there was a bit of confusion about whether the beam python playground examples were even working and it turned out they just got confused by all the runner logging that is output. Is this worth keeping? It seems like it'd be

Re: Hiding logging for beam playground examples

2023-11-15 Thread Joey Tran
> wrote: > >> +1 to at least setting the log level to higher than info. Some runner >> logging (e.g. job started/done) may be useful. >> >> On Tue, Nov 14, 2023 at 9:37 AM Joey Tran >> wrote: >> > >> > Hi all, >> > >> >

Re: Hiding logging for beam playground examples

2023-11-16 Thread Joey Tran
could try to make it > crash and maybe find a stacktrace? Setting logging could like like so: > https://github.com/apache/beam/blob/729c4de416b8252ec99f0a1253ac7af3023733df/sdks/python/apache_beam/examples/wordcount.py#L110 > > On Wed, Nov 15, 2023 at 12:06 PM Joey Tran > wrote: >

How do side inputs relate to stage fusion?

2023-12-14 Thread Joey Tran
Hey all, We have a pretty big pipeline and while I was inspecting the stages, I noticed there is less fusion than I expected. I suspect it has to do with the heavy use of side inputs in our workflow. In the python sdk, I see that side inputs are considered when determining whether two stages are f

Re: How do side inputs relate to stage fusion?

2023-12-14 Thread Joey Tran
but they principally boil down to > shapes that look like this. > > Though this does not introduce a global barrier in streaming, there is > still the analogous per window/watermark barrier that prevents fusion for > the same reasons. > > > > > On Thu, Dec 14, 2023 a

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
also uses Stateful transforms, or SplittableDoFns, those > are usually relegated to the root of a fused stage, and avoids fusions with > each other. That can also cause additional stages. > > If Beam adopted a rigorous notion of Key Preserving for transforms, > multiple stateful tr

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
/manage hints as they like. > Transform annotations might be an alternative, but how those are managed > would be more SDK specific. > > On Fri, Dec 15, 2023, 5:21 AM Joey Tran wrote: > >> I figured out my issue. I thought side inputs were breaking up my >> pipeline but after ex

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
ow them over time. > > On Fri, Dec 15, 2023 at 5:57 AM Joey Tran > wrote: > >> Yeah I can confirm for the python runners (based on my reading of the >> translations.py [1]) that only identical environments are merged together. >> >> The funny thing is that we _origi

Constant for beam:runner:executable_stage:v1 ?

2023-12-20 Thread Joey Tran
Hey all, Is there a particular reason we hard code "beam:runner:executable_stage:v1" everywhere in the python SDK instead of putting it in common_urns?

Re: Constant for beam:runner:executable_stage:v1 ?

2023-12-20 Thread Joey Tran
t of the model so much as an implementation > detail, but it likely does make sense to put somewhere common. > > On Wed, Dec 20, 2023 at 12:55 PM Joey Tran > wrote: > >> Hey all, >> >> Is there a particular reason we hard code >> "beam:runner:executab

(python SDK) "Any" coder bypasses registry coders

2024-01-05 Thread Joey Tran
I've been working with a few data types that are in practice unpicklable and I've run into a couple issues stemming from the `Any` type hint, which when used, will result in the PickleCoder getting used even if there's a coder in the coder registry that matches the data element. This was pretty un

Re: (python SDK) "Any" coder bypasses registry coders

2024-01-05 Thread Joey Tran
non-obvious downstream issues.* On Fri, Jan 5, 2024 at 12:05 PM Robert Bradshaw via dev wrote: > On Fri, Jan 5, 2024 at 7:38 AM Joey Tran > wrote: > >> I've been working with a few data types that are in practice >> unpicklable and I've run into a couple issu

Re: (python SDK) "Any" coder bypasses registry coders

2024-01-05 Thread Joey Tran
Oh actually, overriding the fallback coder doesn't actually do anything because the issue is not with the fallback coders in the registry but the fastprimitivescoder's fallback coder On Fri, Jan 5, 2024 at 12:42 PM Joey Tran wrote: > > I think my original message made it s

[python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Hey all, I was poking around and looking at `Distinct` and was confused about why it was implemented the way it was. Reproduced here: @ptransform_fn @typehints.with_input_types(T) @typehints.with_output_types(T) def Distinct(pcoll): # pylint: disable=invalid-name """Produces a PCollection cont

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
hine traffic, with this form the first > worker will only emit [A, B] and the second [B, C] and only the B > needs to be deduplicated post-shuffle. > > Wouldn't hurt to have a comment to that effect there. > > https://beam.apache.org/documentation/programming-guide/#combine > >

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
hem back to the full groupbykey. Thanks for the clarification! On Fri, Jan 26, 2024 at 12:03 PM Robert Bradshaw wrote: > On Fri, Jan 26, 2024 at 8:43 AM Joey Tran > wrote: > > > > Hmm, I think I might still be missing something. CombinePerKey is made > up of "GBK() | Co

Playground: File Explorer?

2024-02-07 Thread Joey Tran
Hey all, I've been really trying to use Playground for educating new Beam users but it feels like there's something missing. A lot of examples (e.g. Multiple ParDo Outputs) for at least the python API don't seem to do anything observable. For example, the Multiple ParDo Outputs example writes to a

Re: Playground: File Explorer?

2024-02-08 Thread Joey Tran
g to? I checked a few > examples and usually we use beam.Map(print) to display some output values. > > On Wed, Feb 7, 2024 at 8:55 PM Joey Tran > wrote: > >> Hey all, >> >> I've been really trying to use Playground for educating new Beam users >> but it

Re: Playground: File Explorer?

2024-02-08 Thread Joey Tran
4.0. > > On Thu, Feb 8, 2024, 7:18 AM Joey Tran wrote: > >> Here's two: >> >> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo&sdk=python >> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount&sdk=python >> >> Also, how

Issue building python SDK with M2 Mac

2024-03-07 Thread Joey Tran
Hey all, I'm trying to get a beam python SDK dev environment going but I'm a bit stuck. I'm just settings things up with a virtual env as specified in the docs[1], but `pip install -e .[gcp,test]` ends with a clang error: ``` clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DND

Re: Hiding logging for beam playground examples

2024-03-07 Thread Joey Tran
quot;"""""""""""""""""""""""""""""""""""""""""""""""""

Re: Hiding logging for beam playground examples

2024-03-08 Thread Joey Tran
andrey.devyat...@akvelon.com> wrote: > Hi Joey, > > Thanks for reaching out! I see that your changes haven't been deploed yet, > so I've triggered the corresponding job and Playground will be updated soon. > > Thanks, > Andrey > > > > *From: *Joey Tran

Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
Hey all, Using an identity function for FlatMap comes up more often than using FlatMap without an identity function. Would it make sense to use the identity function as a default?

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
> wrote: > Hi, you can use beam.Flatten() instead. > > On Thu, Mar 21, 2024 at 10:55 AM Joey Tran > wrote: > >> Hey all, >> >> Using an identity function for FlatMap comes up more often than using >> FlatMap without an identity function. Would it make sense to use the >> identity function as a default? >> >> >> >>

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
latMap a default arg of lambda x: x is an interesting idea. The >> only downside I see is a less clear error if one forgets to provide this >> (now mandatory) parameter, but maybe that's low enough to be worth the >> convenience? >> >> On Thu, Mar 21, 2024 at 12:02 

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
bmission time in Python if you pass a single > PCollection to Flatten. The scenario you describe concerns a one-element > list. > > On Thu, Mar 21, 2024, 13:43 Joey Tran wrote: > >> I think it'd be quite surprising if beam.Flatten would become equivalent >> to FlatMap

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
; > seems a bit error prone. > > > On Thu, Mar 21, 2024 at 2:23 PM Joey Tran > wrote: > >> Ah, I misunderstood your original suggestion then. That makes sense then. >> I have already seen someone get a little confused about the names and >> surprised that Flatten does

container dev environment: go get issue

2024-03-22 Thread Joey Tran
Hi, I've been banging my head trying to get a dev environment working. I gave up trying to get a local python environment working after I got some weird clang errors and proto generation issues so I've been trying to just use the docker container by running `bash start-build-env.sh` but I'm runni

Re: container dev environment: go get issue

2024-03-22 Thread Joey Tran
t all. I > would remove that 'go get' line. > > There's a different issue at play here too since it was written for > pre-module Go in mind. I'm unfamiliar with that script though. > > I'll take a proper look in a few hours. > > On Fri, Mar 22, 2024, 5

[Python SDK] Feedback for deferred side inputs + combiners

2024-03-29 Thread Joey Tran
I posted a PoC PR [1] for fixing deferred side inputs with combiners in the python SDK. Would someone be willing to take a look at it? I have it working but could use some feedback on where to take it next. It looks like bundle processor combiner operations don't currently support side inputs [2]

tox issues in dev container

2024-04-05 Thread Joey Tran
I think I might be doing something silly with my environment. I'm trying to lint using tox in a dev container, but running tox ends with this error: ``` (env) jtran@[Beam Build Env.]:~/beam {flatmapdefault} ] $ tox File "/usr/lib/python3/dist-packages/tox/reporter.py", line 32, in __init__

Re: tox issues in dev container

2024-04-05 Thread Joey Tran
Yeah that was the tox command I was running On Fri, Apr 5, 2024, 4:37 PM XQ Hu via dev wrote: > > https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-LintandFormattingChecks > > This generally works well. Have you checked this? > > On Fri, Apr 5, 2024 at

Re: [Python SDK] Feedback for deferred side inputs + combiners

2024-04-11 Thread Joey Tran
are relevant. >- proposed approach and considered alternatives >- any runner-specific considerations. > > Thanks, > Valentyn > > On Fri, Mar 29, 2024 at 5:06 AM Joey Tran > wrote: > >> I posted a PoC PR [1] for fixing deferred side inputs with combiners in >&

[Python SDK] Allowing int values for floats when checking types

2024-06-28 Thread Joey Tran
Hey all, We're running into a fairly common situation where we get a TypeCheckError when we supply an int value where a float parameter is expected. It's easy enough to work around this on a case-by-case basis by just casting ints to floats, but this is an easy thing to forget to do, which means

Beam Example Bugs

2024-07-03 Thread Joey Tran
There are a handful of bugs with the interactive examples on the Beam docs page https://beam.apache.org/documentation/transforms/python/elementwise/flatmap/#example-3-flatmap-without-a-function https://beam.apache.org/documentation/transforms/python/elementwise/map/ (example 8) I wonder if it mig

Heartbeat for worker?

2024-10-04 Thread Joey Tran
Hey all, I have a python runner I've written and I'm debugging the case when a SDK worker crashes. Currently my driver/ runner starts the SDK worker and then pushes data / instructions through the data and control connections, but this just silently waits forever if the worker has actually crashed

Re: Globally, PerKey, and... PerGrouped?

2024-09-29 Thread Joey Tran
Ah I realize now that `standard_optimize_phases` is not actually currently used by the fn runner so these changes don't effectively do anything On Sun, Sep 29, 2024 at 3:19 PM Joey Tran wrote: > I took a crack at trying to replace GBK+CombineValues with CBK so then it > doesn't

Re: Globally, PerKey, and... PerGrouped?

2024-09-29 Thread Joey Tran
I took a crack at trying to replace GBK+CombineValues with CBK so then it doesn't matter what the user chooses to do. https://github.com/apache/beam/pull/32592 On Fri, Sep 27, 2024 at 4:32 PM Joey Tran wrote: > Hmm, it makes sense from a runner optimization perspective, but I think

search for python SDK API Docs appears to be broken

2024-10-18 Thread Joey Tran
https://beam.apache.org/releases/pydoc/current/search.html?q=with_exception_handling&check_keywords=yes&area=default This page for example doesn't seem to load or for a couple people I asked to try it. Just wanted to report.

Re: Globally, PerKey, and... PerGrouped?

2024-10-04 Thread Joey Tran
Would anyone be up to take a look at the approach here? I think the standard translation phases are currently used by the portable runner so the portable runner tests might indicate it works? On Sun, Sep 29, 2024, 3:19 PM Joey Tran wrote: > I took a crack at trying to replace

Re: Heartbeat for worker?

2024-10-04 Thread Joey Tran
om the status message, so I don't think there's value in another > RPC. > > Grpc possibly has something built in as well, if only for connection > monitoring. > > On Fri, Oct 4, 2024, 10:46 AM Joey Tran wrote: > >> Hey all, >> >> I have a python runner

Re: Heartbeat for worker?

2024-10-04 Thread Joey Tran
> tells you it is alive and doing work as needed. I had do that via the > control channel. > > Also, I am very curious about this new runner.. mind sharing? > > On Fri, Oct 4, 2024 at 1:46 PM Joey Tran > wrote: > >> Hey all, >> >> I have a python runn

Cancelling a processbundlerequest

2024-10-30 Thread Joey Tran
Hey all, Is there an RPC for cancelling a processbundlerequest? e.g. if a bundle is taking a while and you want it to timeout and start a different bundle?

Re: Cancelling a processbundlerequest

2024-10-30 Thread Joey Tran
If the bundle is a single unsplittable element though a split request won't do anything, right? On Wed, Oct 30, 2024, 6:45 PM Robert Bradshaw via dev wrote: > No, but you can ask it to split ASAP. > > On Wed, Oct 30, 2024 at 2:32 PM Joey Tran > wrote: > > > > Hey

Re: search for python SDK API Docs appears to be broken

2024-11-01 Thread Joey Tran
Looks like this is working with Beam 2.59.0 >> https://beam.apache.org/releases/pydoc/2.59.0/search.html?q=with_exception_handling&check_keywords=yes&area=default >> >> But it is broken on current and 2.60.0. >> >> On Fri, Oct 18, 2024 at 3:39 PM Joey Tran >> wr

Re: Heartbeat for worker?

2024-10-31 Thread Joey Tran
The gRPC worker handlers in the Python SDK don't define any helpers for using the status channel right? If I write these for my runner (in the same vein as how the other gRPC servers are implemented), is this contributing upstream? On Fri, Oct 4, 2024 at 2:33 PM Joey Tran wrote: >

Broken Beam Example

2024-09-23 Thread Joey Tran
Hey all, The ToList python example seems broken. Not sure if there's an issue for this already. Hopefully it's an easy fix (thanks @Efrem Braun for the report) https://beam.apache.org/documentation/transforms/python/aggregation/tolist/ Best, Joey -- Joey Tran | Staff

Re: [QUESTION] Add Company logo

2024-09-24 Thread Joey Tran
I added it just yesterday :) On Tue, Sep 24, 2024, 6:45 PM XQ Hu via dev wrote: > Please check this: > https://github.com/apache/beam/blob/master/website/ADD_LOGO.md > > On Tue, Sep 24, 2024 at 6:05 PM Ahmet Altay via dev > wrote: > >> Hi Hope, >> >> Just checking: Were you able to add your log

Re: [QUESTION] Add Company logo

2024-09-24 Thread Joey Tran
create a pull request when we decide. > > We expect to have a decision in October. > > I also want to see the company logo on the Beam website. > > Thank you :) > > 2024년 9월 25일 (수) 오전 10:10, Joey Tran 님이 작성: > >> I added it just yesterday :) >> >> On Tu

Re: Globally, PerKey, and... PerGrouped?

2024-09-27 Thread Joey Tran
Thoughts @dev for a `GroupedValues` version of combiners? Named `PerGroup` or `PerGroupedValues`? On Fri, Sep 27, 2024 at 2:57 PM Valentyn Tymofieiev wrote: > > On Fri, Sep 27, 2024 at 11:13 AM Joey Tran > wrote: > >> Ah! That is exactly the kind of primitive I was looki

Re: Globally, PerKey, and... PerGrouped?

2024-09-27 Thread Joey Tran
s/portability/fn_api_runner/translations.py#L953 > > https://github.com/apache/beam/blob/release-2.48.0/sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py#L1161 > > > I don't remember what the status is of enabling this by default (IIRC, > they're condit

Re: Building dev docker image

2024-09-19 Thread Joey Tran
ough, if you get it > working for yourself. (Tag lostluck in GitHub). > > I think if you search the dev list, you might find previous discussions on > the topic of the dev container. > > Robert Burke > > > On Thu, Sep 19, 2024, 2:17 PM Joey Tran wrote: > >> Hi

Building dev docker image

2024-09-19 Thread Joey Tran
-o pipefail -c go get github.com/linkedin/goavro/v2" did not complete successfully: exit code: 1 Any tips? Cheers, -- Joey Tran [image: Schrödinger, Inc.] <https://schrodinger.com/>

Re: Building dev docker image

2024-09-19 Thread Joey Tran
cython isn't installed in the virtualenv. Running `pip uninstall cython` resulted in: `WARNING: Skipping cython as it is not installed.` On Thu, Sep 19, 2024 at 7:05 PM Robert Bradshaw via dev wrote: > > > On Thu, Sep 19, 2024 at 3:43 PM Joey Tran > wrote: > >> Ah

Re: Building dev docker image

2024-09-20 Thread Joey Tran
g-out the cython dep in pyproject.toml might > remove cythonization too. > > On Thu, Sep 19, 2024 at 4:16 PM Joey Tran > wrote: > >> cython isn't installed in the virtualenv. Running `pip uninstall cython` >> resulted in: >> `WARNING: Skipping cython as it is

Re: FlumeJava - what happened to PObjects

2025-01-31 Thread Joey Tran
here because I was wondering if the same line of thought might've lead to the removal of PObject (just out of curiosity) On Fri, Jan 31, 2025 at 11:57 AM Kenneth Knowles wrote: > > > On Fri, Jan 31, 2025 at 11:20 AM Joey Tran > wrote: > >> Is there an equivalent

Re: FlumeJava - what happened to PObjects

2025-01-31 Thread Joey Tran
nt of PObject. Given that the Beam API >>> needed to work with the windowing model, we needed something like a PObject >>> that could be windowed. This is what PCollectionView provides. >>> >> >> +1, I'd forgotten about the windowing complexities as well. >> >

[python] CombineGlobally.without_defaults for streaming

2025-02-10 Thread Joey Tran
I'm trying to work with windows and I was hoping my current batch workflows could work transparently with and without windows (I know "without windows" really means with GlobalWindows), but any batch workflow that uses a `CombineGlobally` doesn't work without using `without_defaults` otherwise it g

FlumeJava - what happened to PObjects

2025-01-30 Thread Joey Tran
I read the FlumeJava paper and I was just curious what happened to PObjects. They seem like a useful construct. Do they exist in the java SDK in some version still? Or were they done away with because they made pipeline optimization more difficult?

Sdk worker parallellism

2024-12-05 Thread Joey Tran
Is there any kind of multiprocess parallelism built into the Python sdk worker? In other words, is there a way for my runner to start a worker and have it has multiple cores instead of having one worker per core? I thought I saw this capability somewhere but now that I look I can only see the pipel

[python] flatten unzipping

2025-01-27 Thread Joey Tran
I heard mention that there is a flatten unzipping optimization implemented by some runners. I didn't see that in the python optimizations in translations.py[1]. Just curious what this optimization is? I think I get the general gist in that you dont necessarily need to combine the input pcollection

Re: [python] flatten unzipping

2025-01-27 Thread Joey Tran
25 at 4:09 PM Robert Bradshaw via dev wrote: > On Mon, Jan 27, 2025 at 1:00 PM Joey Tran > wrote: > > > > I heard mention that there is a flatten unzipping optimization > implemented by some runners. I didn't see that in the python optimizations > in translatio

Re: [python] flatten unzipping

2025-01-27 Thread Joey Tran
ecause I had reached my limit > in dealing with the Graph at the time. > > > > The Python SDK approach to the optimizations is very set theory based, > which isn't as natural (to me at least) for handling the flatten unzipping. > > > > Not impossible,

Re: [python] flatten unzipping

2025-01-27 Thread Joey Tran
nd doesn't take into account the size of the data being > materialized). whoa! If the optimization takes into account the size of the data being materialized, does that mean it happens at runtime? On Mon, Jan 27, 2025 at 5:47 PM Robert Bradshaw wrote: > On Mon, Jan 27, 2025 at 2:29 

Re: Testing Beam YAML piplines

2025-03-17 Thread Joey Tran
Got an access denied on the doc. Is intended to be shared publicly? On Mon, Mar 17, 2025 at 6:17 PM Robert Bradshaw via dev wrote: > I've been thinking a bit about productionizing yaml pipelines, and a > large part of that involves being able to write and run tests. > > I've put my thoughts up a

Re: [python] is merge_accumulators called with stream of accumulators?

2025-03-17 Thread Joey Tran
erate over it twice. I'm considering implementing CombineFn execution with my runner such that only two accumulators are ever loaded into memory at a time. Not sure if CombineFn contract supports this though. On Fri, Mar 14, 2025 at 8:20 AM Joey Tran wrote: > My intuition says no says al

Re: Make cloudpickle the default library in Beam 2.65.0

2025-04-28 Thread Joey Tran
Naive question, but why is beam upgrading to cloudpickle? I saw this doc: https://docs.google.com/document/d/1G5Q0ckX5sKQRQD1yEkLCPQL7N6B-AL9Cb1p0zlOOfQU/edit?tab=t.0 Is the main reason because cloudpickle is more actively maintained? On Mon, Apr 28, 2025 at 6:51 PM Claudius van der Merwe wrot

Re: Make cloudpickle the default library in Beam 2.65.0

2025-04-30 Thread Joey Tran
 AM Robert Bradshaw wrote: > On Tue, Apr 29, 2025 at 7:51 PM Joey Tran > wrote: > > > > Does cloudpickle make --save_main_session unnecessary? As in, will more > transforms defined in __main__ "just work"? > > Yes. Or at least it "just works" much more oft

Re: Make cloudpickle the default library in Beam 2.65.0

2025-04-29 Thread Joey Tran
sion and runtime. > - previously, some bugs and feature requests Beam requested in dill took > a long time to be implemented and released. > > [1] https://lists.apache.org/thread/dvxvclhok0fx48955x6szvw4kotxh87n > [2] https://github.com/apache/beam/issues/22893#issuecomment-150235

Re: Website documentation rendering

2025-04-27 Thread Joey Tran
Looking forward to this example. I never knew about this transform :-) On Sun, Apr 27, 2025, 1:52 PM XQ Hu via dev wrote: > If you create a PR, the precommit website stage GCS workflow should be > triggered. You should be able to view this with your PR. Example: > https://github.com/apache/beam/

Re: [YAML/Python] SchemadParDo

2025-05-10 Thread Joey Tran
Not currently On Sat, May 10, 2025, 12:48 AM Reuven Lax wrote: > Does this work with nested fields? Can you specify Input_field="a.b.c"? > > On Fri, May 9, 2025 at 7:18 PM Joey Tran > wrote: > >> Sure! >> >> Given a DoFn that has... >>

[YAML/Python] SchemadParDo

2025-05-09 Thread Joey Tran
I've written a `SchemadParDo(input_field: str, output_field, dofn:DoFn)` transform for more easily writing a Schemad transform given a DoFn. Is this something worth upstreaming into the Beam Python SDK? I wrote it to make it easier to convert our current set of dofn's into schemad dofns for use wi

Re: [YAML/Python] SchemadParDo

2025-05-09 Thread Joey Tran
t;, output_field="word")) And it'd produce Row(word="hello", id="id") and Row(word=""world", id="id") On Fri, May 9, 2025, 9:57 PM Reuven Lax via dev wrote: > Can you explain a bit how SchemadParDo works? > > On Fri, May 9, 2

Re: [YAML/Python] SchemadParDo

2025-05-12 Thread Joey Tran
dofn's that implement buffers. On Mon, May 12, 2025 at 11:30 AM Robert Bradshaw wrote: > On Mon, May 12, 2025 at 8:25 AM Joey Tran > wrote: > >> and preserve rather than elide the input field (at least as an option). >>> >> >> Oh yes, that is act

Re: [YAML/Python] SchemadParDo

2025-05-13 Thread Joey Tran
ds. That also doesn't feel like the greatest solution. Are there any other ideas floating around? On Mon, May 12, 2025 at 3:08 PM Joey Tran wrote: > Does this handle the full DoFn spec (e.g. bundle start/finish, WIndowFn >>> params, state and timers, etc.)? >> >> It

Re: [YAML/Python] SchemadParDo

2025-05-12 Thread Joey Tran
e to adapt a > DoFn to apply to a single field of a Row? This certainly seems to > have value. I might suggest a syntax like > > schema_pcoll | DoToField(SomeDoFn(), input_field="element", > output_field="word") > > and preserve rather than elide the input

Re: MultimapState missing from Python SDK

2025-06-17 Thread Joey Tran
On Tue, Jun 17, 2025 at 7:52 PM Robert Bradshaw wrote: > On Tue, Jun 17, 2025 at 3:57 PM Robert Burke wrote: > > > > +1 > > > > You should be able to use the Prism runner to implement this locally. > > > > Prism passes the full suite of java MultimapState tests, and will ensure > the implementat

Re: [python] subprocess call of "pip freeze" per pipeline

2025-06-05 Thread Joey Tran
ts. > > I'd vote we disable this for Prism and can take that on as part of > enabling prism as the default runner [1]. > > [1] WIP - https://github.com/apache/beam/pull/34612 > > On Wed, Jun 4, 2025 at 10:48 AM Joey Tran > wrote: > >> Hey all, >> >&g

[python] subprocess call of "pip freeze" per pipeline

2025-06-04 Thread Joey Tran
Hey all, We recently upgraded to Beam 2.63 from 2.50. After the upgrade, our unit tests testing our runner saw a 4x-5x performance hit. It turned out it was because for every pipeline run, the default `PipelineRunner.run_pipeline invokes `PIpelineRunner.default_environment`[1] which eventually res

Re: Environments and compatibility Was: [Proposal] Beam ML containers

2025-07-16 Thread Joey Tran
On Wed, Jul 16, 2025 at 2:24 AM Chamikara Jayalath via dev < dev@beam.apache.org> wrote: > > > On Tue, Jul 15, 2025 at 5:32 PM Joey Tran > wrote: > >> "Environment" has become almost-but-not-quite assumed to be a docker >>> container image specifica

[Python] using strictest type hints

2025-07-15 Thread Joey Tran
Hey all, @Jack McCluskey 's great talk on python type hinting at the beam summit taught me that the "trivially" inferred types are only used if the beam Map/FlatMap/Filter/ParDo functions aren't already type hinted. This seems like it could be a waste of information. For example, the following pip

Re: [Python] using strictest type hints

2025-07-16 Thread Joey Tran
`` def row_identity(inp_row: beam.Row) -> beam.Row: return inp_row (p | Create([Row(x=1, y=2)]) | Map(row_identity) ``` Would be able to infer the row schema automatically rather than resorting to using the user-typed and general `beam.Row`. That said, I do agree that the user confusi

Re: [Proposal] Beam ML containers

2025-07-03 Thread Joey Tran
On Tue, Jul 1, 2025 at 2:37 PM Danny McCormick via dev wrote: > I think it is probably reasonable to automate this when a GPU resource > hint is used. I think we still need to expose this as a config option for > the ML containers (and it is the same with distroless) since it is pretty > difficul

  1   2   >