Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-08 Thread Robert Bradshaw via dev
This is a great idea. I would like to approach this from the perspective of making it easy to provide a catalog of well-defined transforms for use in expansion services from typical SDKs and also elsewhere (e.g. for documentation purposes, GUIs, etc.) Ideally everything about what a transform is (i

Re: [VOTE] Release 2.41.0, release candidate #1

2022-08-15 Thread Robert Bradshaw via dev
+1 (binding). I verified the release artifacts and signatures, and ran a couple of simple Python pipelines. On Mon, Aug 15, 2022 at 12:40 PM Chamikara Jayalath via dev wrote: > > > > On Mon, Aug 15, 2022 at 11:37 AM Kiley Sok wrote: >> >> Thanks everyone! >> >> @Chamikara Jayalath The Spark iss

Re: Design Doc for Controlling Batching in RunInference

2022-08-15 Thread Robert Bradshaw via dev
Thanks. I added some comments to the doc. I agree with Brian that it makes sense to figure out how this interacts with batched DoFns, as we'll want to migrate to that. (Perhaps they're already ready to migrate to as a first step?) On Fri, Aug 12, 2022 at 1:03 PM Brian Hulette via dev wrote: > >

Re: Forward StackOverflow questions with the apache-beam tag to a new mailing list

2022-08-17 Thread Robert Bradshaw via dev
On Wed, Aug 17, 2022 at 5:15 AM Alexey Romanenko wrote: > > Good point about unanswered SO questions. +1 that we need to improve a > situation there. > > Yes, we may try to stream them to a new dedicated list but it will require > people here to subscribe to and check regularly one more list whi

Re: Cartesian product of PCollections

2022-09-19 Thread Robert Bradshaw via dev
If one of your inputs fits into memory, using side inputs is definitely the way to go. If neither side fits into memory, the cross product may be prohibitively large to compute even on a distributed computing platform (a billion times a billion is big, though I suppose one may hit memory limits wit

Re: Cartesian product of PCollections

2022-09-19 Thread Robert Bradshaw via dev
On Mon, Sep 19, 2022 at 1:53 PM Stephan Hoyer wrote: >> >> > > My team has an internal implementation of a CartesianProduct transform, >> > > based on using hashing to split a pcollection into a finite number of >> > > groups and CoGroupByKey. >> > >> > Could this be contributed to Beam? > > > I

Re: [VOTE] Release 2.42.0, release candidate #2

2022-10-13 Thread Robert Bradshaw via dev
+1 (binding) Validated release artifacts and signatures. Tested a Python pipeline on a clean install. On Thu, Oct 13, 2022 at 1:22 PM Ritesh Ghorse via dev wrote: > > +1 (non-binding) > Validated Go SDK Quickstart on Direct and Dataflow runner. > > Thanks, > Ritesh Ghorse > > On Thu, Oct 13, 202

Re: Pipleline portable proto visualizaiton

2022-11-07 Thread Robert Bradshaw via dev
I've got one I use in Python too (including drilling down into composites). It's a portable runner. I should clean it up and make it generally available. On Mon, Nov 7, 2022 at 9:25 AM Robert Burke wrote: > > The Go SDK has a "dot" runner to visualize pipeline protos as a dot graph but > it's it

Re: Pipleline portable proto visualizaiton

2022-11-08 Thread Robert Bradshaw via dev
OK, I put this together as a PR: https://github.com/apache/beam/pull/24037 (I think I've written this beam proto -> dot code half a dozen times by now...it'll be good to have it checked in and reusable.) On Mon, Nov 7, 2022 at 11:06 AM Matt Casters wrote: > > Apache Hop pipelines are actually jus

Re: Configuration Driven File Writes

2022-12-02 Thread Robert Bradshaw via dev
Thanks for looking into this and the careful writeup. I've read the design doc and it looks great, but have a couple of questions. (1) Why did you decide on having a single top-level FileWrite transform whose config is ([common_parameters], [xml-params], [csv-params], ...) rather than separate sch

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Robert Bradshaw via dev
Saving up all the breaking changes until a major release definitely has its downsides (look at Python 3). The migration path is often as important (if not more so) than the final destination. As for this particular change, I would question how the benefit (it's unclear what the exact benefit is--b

A Declarative API for Apache Beam

2022-12-14 Thread Robert Bradshaw via dev
While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it often still has too high a barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be an overwhelming amount o

Re: A Declarative API for Apache Beam

2022-12-15 Thread Robert Bradshaw via dev
On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum wrote: > > This is great! I developed a similar template a year or two ago as a > reference for a customer to speed up their development process and > unsurprisingly it did speed up their development. > Here's an example of the config layout I ca

Re: Testing Multilanguage Pipelines?

2022-12-28 Thread Robert Bradshaw via dev
On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev wrote: > > > Given the increasing importance of multi language pipelines, it does seem > > that we should expand the capabilities of the DirectRunner or just go all > > in on FlinkRunner for testing and local / small scale development > > +

Re: Testing Multilanguage Pipelines?

2022-12-28 Thread Robert Bradshaw via dev
On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis wrote: > > On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw wrote: >> >> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev >> wrote: >> > >> > > Given the increasing importance of multi language pipelines, it does >> > > seem that we should expa

Re: BigTable reader for Python?

2022-12-29 Thread Robert Bradshaw via dev
On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev wrote: > > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson wrote: >> >> Thanks for the detailed answers! >> >> I totally get the points about development & maintenance cost, and, >> from a user perspective, about getting the performance r

Hacking a Rust SDK

2023-01-06 Thread Robert Bradshaw via dev
Hi all, Welcome to 2023! As last year, the Google Dataflow team is kicking the first week off with a hackathon, and one of the projects proposed this year was to throw together a rust SDK. If you're interested, you can follow the progress at https://github.com/kennknowles/beam/tree/rust/sdks/rust

Re: Hacking a Rust SDK

2023-01-06 Thread Robert Bradshaw via dev
rk? > > On Fri, Jan 6, 2023 at 11:09 AM Robert Bradshaw via dev > wrote: >> >> Hi all, >> >> Welcome to 2023! As last year, the Google Dataflow team is kicking the >> first week off with a hackathon, and one of the projects proposed this >> year was

Re: BigTable reader for Python?

2023-01-12 Thread Robert Bradshaw via dev
Were you ever able to get the local runner to work? If not, some more context on the errors would be useful. On Tue, Jan 10, 2023 at 10:00 AM Lina Mårtensson wrote: > > Thanks! Moving my DoFn into a new module worked, and that solved the slowness > as well. > I tried importing it in setup() as w

Re: Request for Jira issue creation permission.

2023-01-13 Thread Robert Bradshaw via dev
On Fri, Jan 13, 2023 at 3:54 PM Becket Qin wrote: > > Hi Beam devs, > > A few of us are currently working on Flink runner to migrate it away from the > semi-deprecated DataSet API. Thanks! > Can someone help grant the following Jira IDs the permission to work on Jira > issues? We have migrate

Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

2023-02-01 Thread Robert Bradshaw via dev
This sounds reasonable to me. One question I have is why a user would prefer to stick with the DataSet API if the DataStream API is available. Would there be any user-visible difference? On Wed, Feb 1, 2023 at 1:11 AM Becket Qin wrote: > > Hi Beam devs, > > I'd like to start a discussion about mi

Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

2023-02-01 Thread Robert Bradshaw via dev
, > > Jiangjie (Becket) Qin > > > > On Thu, Feb 2, 2023 at 1:23 AM Robert Bradshaw via dev > wrote: >> >> This sounds reasonable to me. One question I have is why a user would >> prefer to stick with the DataSet API if the DataStream API is >> a

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-07 Thread Robert Bradshaw via dev
Seams reasonable to me. On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote: > > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped > being built and supported since July 2022. I have filed [2] to track the > resolution of this issue. > > Based upon [1], almost eve

Re: A user-deployable Beam Transform Service

2023-02-10 Thread Robert Bradshaw via dev
Thanks. I added some comments to the doc. On Mon, Feb 6, 2023 at 1:33 PM Chamikara Jayalath via dev wrote: > > Hi All, > > Beam PTransforms are currently primarily identified as operations in a > pipeline that perform specific tasks. PTransform implementations were > traditionally linked to spe

Re: [VOTE] Release 2.45.0, Release Candidate #1

2023-02-13 Thread Robert Bradshaw via dev
+1 (binding) Validated release artifacts and signatures and tested a couple of Python pipelines. On Mon, Feb 13, 2023 at 8:57 AM Alexey Romanenko wrote: > > +1 (binding) > > Tested with https://github.com/Talend/beam-samples/ > (Java SDK v8/v11/v17, Spark 3.x runner). > > --- > Alexey > > On 13

Re: new contributor messaging: behaviorbot/welcome

2023-02-21 Thread Robert Bradshaw via dev
On Tue, Feb 21, 2023 at 10:59 AM Kenneth Knowles wrote: > > Agree with Robert here. The human connection is important. Can we have a > behaviorbot that reminds the reviewer to be extra welcoming up front, and > then thankful afterwards, instead? :-) +1 > That said, a bot comment would at least

Re: new contributor messaging: behaviorbot/welcome

2023-02-21 Thread Robert Bradshaw via dev
FWIW, I'm generally in favor of such a bot. I think it really boils down to a concrete proposal of what the content (and triggers) would be. On Tue, Feb 21, 2023 at 1:36 PM Austin Bennett wrote: > > It is fantastic if generally able to address welcoming newcomers manually [ > @Robert Burke ! ] .

Re: [NOTICE] Deprecation Avro classes in "core" and use "extensions/avro" instead for Java SDK

2023-02-22 Thread Robert Bradshaw via dev
Thanks for pushing this through! On Wed, Feb 22, 2023 at 10:38 AM Alexey Romanenko wrote: > > Hi all, > > As a part of migration the Avro-related classes from Java SDK “core” module > to a dedicated extension [1] (as it was discussed here [2] and here [3]), two > important PRs has been merged [

Re: [VOTE] Release 2.46.0, release candidate #1

2023-03-03 Thread Robert Bradshaw via dev
It appears your public key is not published in https://dist.apache.org/repos/dist/release/beam/KEYS . On Fri, Mar 3, 2023 at 8:33 AM Anand Inguva via dev wrote: > > +1 (non-binding) > Tested python wordcount quick start > https://beam.apache.org/get-started/quickstart-py/ on Direct Runner and >

Re: [VOTE] Release 2.46.0, release candidate #1

2023-03-03 Thread Robert Bradshaw via dev
The released artifacts seem to be missing the last commit at https://github.com/apache/beam/commit/c528eab18b32342daed53b750fe330d30c7e5224 . Is this essential to the release, or just useful for validating it? On Fri, Mar 3, 2023 at 11:02 AM Danny McCormick wrote: > > Thanks for calling that out,

Re: [VOTE] Release 2.46.0, release candidate #1

2023-03-03 Thread Robert Bradshaw via dev
+1 (binding). I verified that the artifacts and signatures all look good, all the containers are pushed, and tested some pipelines with a fresh install from one of the Python wheels. On Fri, Mar 3, 2023 at 11:13 AM Danny McCormick wrote: > > > The released artifacts seem to be missing the last c

Re: Requesting review for my pull request

2023-03-09 Thread Robert Bradshaw via dev
Looks like this was taken care of. Thanks, Ritesh! On Wed, Mar 8, 2023 at 8:48 PM Saifuddin Adenwala wrote: > > Greetings ! > Recently I have contributed to the open source apache beam organization and I > wanted to ask for the review of my pull request . > Attaching the link of my pull request

Re: direct runner OOM issue

2023-03-13 Thread Robert Bradshaw via dev
The FnApiRunner is primarily for tiny jobs (development and testing) and holds all the data in memory. You'll likely have to run with a "real" runner to operate over datasets of this size. If you want to run locally, you can pass --runner=FlinkRunner and (assuming you have Java installed) it will r

Re: direct runner OOM issue

2023-03-13 Thread Robert Bradshaw via dev
> > Wilson(Xiaoshuang) Wang > Sr. Software Engineer > > > On Mon, Mar 13, 2023 at 12:13 PM Robert Bradshaw via dev > wrote: >> >> The FnApiRunner is primarily for tiny jobs (development and testing) >> and holds all the data in memory.

Re: Beam Release DockerHub Group

2023-04-17 Thread Robert Bradshaw via dev
I think it'd be good if the intersection between this list and the PMC had cardinality greater than 1. Ahmet might be a good person to keep there. On Mon, Apr 17, 2023 at 9:25 AM Danny McCormick via dev wrote: > Yeah, that is part of the proposal. To be clear, our end state would be a > single g

Re: Beam Release DockerHub Group

2023-04-17 Thread Robert Bradshaw via dev
Well, I don't know that PMC should take precedence over release managers if it comes to that. On Mon, Apr 17, 2023 at 10:11 AM Danny McCormick wrote: > I can ask if we can keep 6 seats instead of 5 (and keep Ahmet in that > seat). If not, my vote would be to stick with the 5 that I suggested, bu

Re: [VOTE] Vendored Dependencies Release

2023-04-17 Thread Robert Bradshaw via dev
+1 On Mon, Apr 17, 2023 at 11:20 AM Chamikara Jayalath via dev < dev@beam.apache.org> wrote: > +1 > > Thanks, > Cham > > On Mon, Apr 17, 2023 at 11:04 AM Kenneth Knowles wrote: > >> +1 >> >> On Fri, Apr 14, 2023 at 1:30 PM Yi Hu via dev >> wrote: >> >>> Please review the release of the followin

Re: [Python SDK] Use pre-released dependencies for Beam python unit testing

2023-04-20 Thread Robert Bradshaw via dev
The main concern here seems to be whether using pre-release candidates would be too disruptive to our workflow. I think this is an easy hypothesis to test out--we can give using prerelease candidates a try, and if that indeed turns out to be a problem we can then do the work of trying to put togeth

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-21 Thread Robert Bradshaw via dev
+1 to not requiring details like this in the Beam model. There is, however, the question of how to pass such implementation-detail specific hints to a runner that requires them. Generally that's done via ResourceHints or annotations, and while the former seems a good fit it's primarily focused on s

Re: [VOTE] Release 2.47.0, release candidate #1

2023-04-27 Thread Robert Bradshaw via dev
The artifacts and signatures all look good, and I validated a couple of Python pipelines in a fresh install. Assuming all the tests (including the Dataflow ones) pass (modulo the two mentioned above; seems a fair justification to not block on those) I'm +1 (binding) on this release. On Wed, Apr 2

Re: [Proposal] Automate Release Signing

2023-05-03 Thread Robert Bradshaw via dev
If this is acceptable per the release policy, huge +1 to automating this step (and not limited to java artifacts alone). On Wed, May 3, 2023 at 1:14 PM Kenneth Knowles wrote: > > To Robert: Good point. I didn't click through. There's always the possibility > that the two branches of the foundati

Re: [PROPOSAL] Preparing for 2.48.0 Release

2023-05-04 Thread Robert Bradshaw via dev
IIRC, we are supporting Python versions until they are out of support. This would suggest keeping 3.7 in 2.48. (Not that it matters much.) Is there a significant gain in dropping 3.7 support before the cut? On Thu, May 4, 2023 at 8:33 AM Jack McCluskey via dev wrote: > > I'd suggest shooting for

Re: [PROPOSAL] Preparing for 2.48.0 Release

2023-05-05 Thread Robert Bradshaw via dev
On Fri, May 5, 2023 at 6:27 AM Anand Inguva via dev wrote: > > >> Is there a significant gain in dropping 3.7 support before the cut? > > No, I think it is just a matter of how soon we want to do it. Absent a compelling reason otherwise, my view would be to just stick with the statement of droppi

Re: [VOTE] Release 2.47.0, release candidate #3

2023-05-09 Thread Robert Bradshaw via dev
Thanks for catching this. This does seem severe enough that we need to fix it before the release. On Sat, May 6, 2023 at 10:15 PM Chamikara Jayalath via dev < dev@beam.apache.org> wrote: > Seems like Python SDK harness containers built for the current RC are > broken. Please see https://github.co

Re: [VOTE] Release 2.47.0, release candidate #3

2023-05-10 Thread Robert Bradshaw via dev
+1 (binding) Artifacts and signatures look good to me, Python wheel installs cleanly and runs simple pipelines. On Wed, May 10, 2023 at 8:23 AM Bruno Volpato via dev wrote: > A little bit late to this thread since I've seen the voting closed email, > but +1 (non-binding) from me as well. > > Te

Re: [DISCUSS] Idempotent initialization of file systems

2023-05-15 Thread Robert Bradshaw via dev
On Mon, May 15, 2023 at 8:38 AM Moritz Mack wrote: > > Hi all, > > I was just looking into an old issue again, SerializablePipelineOptions > calling FileSystems.setDefaultPipelineOptions on deserialization [1]. This > applies to various runners including Flink and Spark, but not Dataflow as far

Re: [PMC Request] Add gpg key to release keys file

2023-05-23 Thread Robert Bradshaw via dev
Done. On Tue, May 23, 2023 at 7:36 AM Danny McCormick via dev wrote: > Hey everyone, as part of automating our release process (see thread here - > https://lists.apache.org/thread/mw9dbbdjtkqlvs0mmrh452z3jsf68sct), could > a PMC member please add the infra supplied gpg public key to our release

Re: Hierarchical fanout with Beam combiners?

2023-05-26 Thread Robert Bradshaw via dev
Yes, with_hot_key_fanout only performs a single level of fanout. I don't think fanning out more than this has been explored, but I would imagine that for most cases the increased IO would negate most if not all of the benefits. In particular, note that we already do "combiner lifting" to do as muc

Re: Hierarchical fanout with Beam combiners?

2023-05-30 Thread Robert Bradshaw via dev
On Tue, May 30, 2023 at 10:37 AM Kenneth Knowles wrote: > > On Sat, May 27, 2023 at 4:20 PM Stephan Hoyer via dev > wrote: > >> On Fri, May 26, 2023 at 2:59 PM Robert Bradshaw >> wrote: >> >>> Yes, with_hot_key_fanout only performs a single level of fanout. I don't >>> think fanning out more th

Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Robert Bradshaw via dev
+1 to this simplification, it's a historical artifact that provides no value. I would love it if we also looked into ways to auto-generate the SchemaTransformProvider (e.g. via introspection if a transform takes a small number of arguments, or uses the standard builder pattern...), ideally with so

Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Robert Bradshaw via dev
you have a Schema we can auto-generate > the associated builder on the PTransform? Either way, more DRY. > > On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> +1 to this simplification, it's a historical artifact that provides

Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Robert Bradshaw via dev
On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev wrote: > Thanks Danny and Jack! Dataflow containers are up! > > Only PMC votes count but feel free to test your use cases and vote on this > thread! > While we need at least 3 affirmative PMC votes to formally do a release, it is definitely t

Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Robert Bradshaw via dev
+1 (binding) On Tue, May 30, 2023 at 5:42 PM Robert Bradshaw wrote: > On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev > wrote: > >> Thanks Danny and Jack! Dataflow containers are up! >> >> Only PMC votes count but feel free to test your use cases and vote on >> this thread! >> > > While w

Re: 2.48.0 Release PMC Finalization

2023-06-02 Thread Robert Bradshaw via dev
I went ahead and took care of this today. Should be good to go. On Fri, Jun 2, 2023 at 10:29 AM Ritesh Ghorse via dev wrote: > The preceding steps before this (blog post, releasing on github, website > updates are done). > Waiting on this to be completed before sending out the official > announc

Re: Ensuring a task does not get executed concurrently

2023-06-12 Thread Robert Bradshaw via dev
If you absolutely cannot tolerate concurrency an external locking mechanism is required. While a distributed system often waits for a work item to fail before trying it, this is not always the case (e.g. backup workers may be scheduled and whoever finishes first is determined to be the successful a

Re: Beam YAML Pipelines Proposal

2023-06-22 Thread Robert Bradshaw via dev
We also presented YAML in the lightning talk session at the Beam summit where it was well received. (No slides, mostly showed examples and talked about the concept.) Another useful doc is a list of proposed projects/improvements: https://s.apache.org/beam-yaml-pipelines-improvements Also, feel fr

Re: Proposal to reduce the steps to make a Java transform portable

2023-06-22 Thread Robert Bradshaw via dev
ava and cross-language. >>>>>> >>>>> >>>>> +1 but I think the hard part today is to convert existing PTransforms >>>>> to be schema-aware transform compatible (for example, change input/output >>>>> types and mak

Re: [DISCUSS] Enable Github Discussions?

2023-07-06 Thread Robert Bradshaw via dev
I'm also -1 on introducing another forum, and concur with Alexey that mailing lists are a (required) deep part of the culture for apache projects. If there's something qualitatively and significantly different about discussions that makes it a better fit, I would consider it. (E.g. IMHO the struct

Re: [beam-starter-typescript]: Missing place to create issue

2023-07-06 Thread Robert Bradshaw via dev
On Wed, Jun 14, 2023 at 2:05 PM Austin Bennett wrote: > A few additional thoughts: > > * @Anyone --> Should each starter repo allow issues? Or, better to file > issues in https://github.com/apache/beam/issues ? > I'm on the fence. It would make sense to allow issues here, but I'm also concerne

Re: [VOTE] Release 2.49.0, release candidate #2

2023-07-13 Thread Robert Bradshaw via dev
+1 (binding) Verified all the artifacts and signatures, spot-checked docker images, and tried a simple pipeline against one of the Python wheels in a fresh install. Thanks for putting this together. On Thu, Jul 13, 2023 at 2:03 AM Jan Lukavský wrote: > +1 (binding) > > Tested Java SDK with Flin

Re: [Feature Proposal] Add ARM Support to Beam SDK Container Images

2023-07-19 Thread Robert Bradshaw via dev
Thanks. Left a few comments on the doc. Looking forward to ARM support. On Tue, Jul 18, 2023 at 3:59 PM Valentyn Tymofieiev via dev < dev@beam.apache.org> wrote: > Hi Celeste, > > Thanks for the proposal and researching the options. Using multi-arch > images seems like a good way to reduce the co

Re: [VOTE] Vendored Dependency guava 32.1.2-jre Release

2023-08-07 Thread Robert Bradshaw via dev
+1 (binding) Verified the signatures and checksums of the artifacts and (somewhat vacuous) source tarball. Also verified the artifact doesn't leak classes outside apache/beam/vendor . It'd be great if we could move this artifact creation and signing to CI, see https://lists.apache.org/thread/y5b3

Re: [Discuss] Get rid of OWNERS files

2023-08-10 Thread Robert Bradshaw via dev
On Tue, Aug 8, 2023 at 9:50 AM Robert Burke wrote: > > Either we keep OWNERS and have the review bot use them, or we remove them and > use the reviews bot config as the single source of truth. +1. And I don't see any reason we're going to be any better at keeping them up to date than we have in

Re: Automatic signing of releases

2023-08-24 Thread Robert Bradshaw via dev
Yes, this is a great step forward (both the automation, and the clarified guidance). Hopefully we can automate virtual everything but the voting away. On Thu, Aug 24, 2023 at 8:56 AM Ismaël Mejía wrote: > Ah excellent, I was not aware it was the case, great to know we are in > advance ! > > On T

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via dev
Ah, I see. Yeah, I've thought about using an iterable for the whole bundle rather than start/finish bundle callbacks, but one of the questions is how that would impact implicit passing of the timestamp (and other) metadata from input elements to output elements. (You can of course attach the metad

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via dev
I would like to figure out a way to get the stream-y interface to work, as I think it's more natural overall. One hypothesis is that if any elements are carried over loop iterations, there will likely be some that are carried over beyond the loop (after all the callee doesn't know when the loop is

Re: [PROPOSAL] Design Doc template for PTransforms

2023-08-24 Thread Robert Bradshaw via dev
+1 I'd love for this information to be accessible programmatically as well (both directions: extracting parameters from a transform and constructing a transform from parameters). Making this pattern easy could encourage compliance. On Thu, Aug 24, 2023 at 8:54 AM Svetak Sundhar via dev wrote: >

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via dev
On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath wrote: > > > On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw > wrote: > >> I would like to figure out a way to get the stream-y interface to work, >> as I think it's more natural overall. >> >> One hypothesis is that if any elements are carrie

Re: [VOTE] Release 2.50.0, release candidate #2

2023-08-29 Thread Robert Bradshaw via dev
+1 (binding) Verified the artifacts and signatures, they all look good. Tried a simple Python pipeline in a fresh install that worked fine. Thanks for putting this together. On Thu, Aug 24, 2023 at 4:09 PM Robert Burke wrote: > Hi everyone, > Please review and vote on the release candidate #2

Re: Contribution of Asgarde: Error Handling for Beam?

2023-09-05 Thread Robert Bradshaw via dev
I think this is a great library. I'm on the fence of whether it makes sense to include with Beam proper vs. be a library that builds on top of Beam. (Would there be benefits of tighter integration? There is the maintenance/loss of governance issue.) I am definitely not on the side that the entire B

Re: [Bug?] Combiner components don't inherit annotations of source CombineByKey

2023-09-15 Thread Robert Bradshaw via dev
Could you clarify what you mean by annotating the transform? On Fri, Sep 15, 2023 at 9:57 AM Joey Tran wrote: > While implementing a runner, we tried annotating a CombineByKey transform. > I noticed that the annotations for the CBK are then lost in the fusion > optimization stage when the CBK is

Re: [Bug?] Combiner components don't inherit annotations of source CombineByKey

2023-09-15 Thread Robert Bradshaw via dev
Very clear now :). Thanks for the fix; it looks good. On Fri, Sep 15, 2023 at 5:07 PM Joey Tran wrote: > Ended up just filing a PR [1] > > [1] https://github.com/apache/beam/pull/28489 > > > On Fri, Sep 15, 2023 at 12:51 PM Joey Tran > wrote: > >> While implementing a runner, we tried annotatin

Re: User-facing website vs. contributor-facing website

2023-09-21 Thread Robert Bradshaw via dev
TBH, I'm not a huge fan of the wikis either. My ideal flow would be something like g3doc, and markdown files in github do a reasonable enough job emulating that. (I don't think the overhead of having to do a PR for small edits like typos is oneros, as those are super easy reviews to do as well...)

Re: Runner Bundling Strategies

2023-09-21 Thread Robert Bradshaw via dev
Dataflow uses a work-stealing protocol. The FnAPI has a protocol to ask the worker to stop at a certain element that has already been sent. On Thu, Sep 21, 2023 at 4:24 PM Joey Tran wrote: > Writing a runner and the first strategy for determining bundling size was > to just start with a bundle s

Re: Runner Bundling Strategies

2023-09-22 Thread Robert Bradshaw via dev
On Fri, Sep 22, 2023 at 7:23 AM Byron Ellis via dev wrote: > I've actually wondered about this specifically for streaming... if you're > writing a pipeline there it seems like you're often going to want to put > high fixed cost things like database connections even outside of the bundle > setup.

Re: User-facing website vs. contributor-facing website

2023-09-22 Thread Robert Bradshaw via dev
Related, I stumbled >> across this the other day: https://github.com/apache/beam-site which >> appears to be unused which could probably even have different review and >> committer sets if we wanted? >> >> On Thu, Sep 21, 2023 at 3:19 PM Robert Bradshaw via dev < &

Re: Runner Bundling Strategies

2023-09-22 Thread Robert Bradshaw via dev
On Fri, Sep 22, 2023 at 10:58 AM Jan Lukavský wrote: > On 9/22/23 18:07, Robert Bradshaw via dev wrote: > > On Fri, Sep 22, 2023 at 7:23 AM Byron Ellis via dev > wrote: > >> I've actually wondered about this specifically for streaming... if you're >> w

[YAML] Declarative beam (aka YAML) coordination

2023-09-25 Thread Robert Bradshaw via dev
Given the interest in the YAML work by multiple parties, we put together https://s.apache.org/beam-yaml-contribute to more easily coordinate on this effort. Nothing that surprising--we're going to continue using the standard lists, github, etc.--but it should help for folks who want to get started.

Re: Runner Bundling Strategies

2023-09-26 Thread Robert Bradshaw via dev
nction >>> <https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/TwoPhaseCommitSinkFunction.html> >>> which >>> waits for a checkpoint. In Beam, this is the reason we introduced >>> RequiresSt

Re: Runner Bundling Strategies

2023-09-27 Thread Robert Bradshaw via dev
>> elements are not considered "processed" until FinishBundle. >>>>> >>>>> You are right about Flink. In many cases this is fine - if Flink rolls >>>>> back to the last checkpoint, the watermark will also roll back, and >>>

Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Robert Bradshaw via dev
BeamJava and BeamPython have the exact same behavior: transform names within must be distinct [1]. This is because we do not necessarily know at pipeline construction time if the pipeline will be streaming or batch, or if it will be updated in the future, so the decision was made to impose this res

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-04 Thread Robert Bradshaw via dev
+1 (binding) Verified artifacts and signatures and tested a simple python pipeline in a fresh environment with a wheel. On Wed, Oct 4, 2023 at 8:05 AM Ritesh Ghorse via dev wrote: > +1 (non-binding) validated Go SDK quickstart and Python Streaming > quickstart on Dataflow runner. > > Thanks! >

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Robert Bradshaw via dev
Huh. This used to be a hard error in Java, but I guess it's togglable with an option now. We should probably add the option to toggle Python too. (Unclear what the default should be, but this probably ties into re-thinking how pipeline update should work.) On Thu, Oct 5, 2023 at 4:58 AM Joey Tran

[YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-09 Thread Robert Bradshaw via dev
Currently the various file writing configurations take a single parameter, path, which indicates where the (sharded) output should be placed. In other words, one can write something like pipeline: ... sink: type: WriteToParquet config: path: /beam/filesytem/dest and

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-09 Thread Robert Bradshaw via dev
.On Mon, Oct 9, 2023 at 1:11 PM Robert Burke wrote: > I'll note that the file "Writes" in the Go SDK are currently an unscalable > antipattern, because of this exact question. > > Aside from carefully examining other SDKs it's not clear how one authors > a reliable, automatically shardable, wind

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-09 Thread Robert Bradshaw via dev
ing pattern, one could have a full filepattern that includes format parameters for dynamically computed bits as well as the shard number, windowing info, etc. (There are pros and cons to this.) > On Mon, Oct 9, 2023 at 12:37 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >&

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
"real" SDKs. >>> >>> For dynamic destinations, I think just making the "path" component >>> support a lambda that is parameterized by the input should be adequate >>> since this allows customers to direct files written to different >>> destination

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
gt;> >>> For dynamic destinations, I think just making the "path" component >>> support a lambda that is parameterized by the input should be adequate >>> since this allows customers to direct files written to different >>> destination directories. >>>

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
ould be the best way to specify a lambda here though. > Maybe a regex or the name of a Python callable ? > I'd rather not require Python for a pure Java pipeline, but some kind of a pattern template may be sufficient here. > On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev &

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
hould be adequate >>>> since this allows customers to direct files written to different >>>> destination directories. >>>> >>>> sink: >>>> type: WriteToParquet >>>> config: >>>> path: >>

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
gt;> type: WriteToParquet >>> config: >>> path: >>> prefix: >>> suffix: >>> >>> I'm not sure what would be the best way to specify a lambda here though. >>> Maybe a regex or the name of a Python

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
>>>> destination directories. >>>> >>>> sink: >>>> type: WriteToParquet >>>> config: >>>> path: >>>> prefix: >>>> suffix: >>>> >>>> I&

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Robert Bradshaw via dev
ty of use-cases I believe. Fully >>>>>> customizing the file pattern sounds like a more advanced use case that >>>>>> can >>>>>> be left for "real" SDKs. >>>>>> >>>>> >>>>> Yea, we don

Re: [QUESTION] Why no auto labels?

2023-10-10 Thread Robert Bradshaw via dev
I would definitely support a PR making this an option. Changing the default would be a rather big change that would require more thought. On Tue, Oct 10, 2023 at 4:24 PM Joey Tran wrote: > Bump on this. Sorry to pester - I'm trying to get a few teams to adopt > Apache Beam at my company and I'm

Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Robert Bradshaw via dev
Does this change any development practices? E.g. if I clone the repo, I'm assuming I couldn't run "setup.py test" anymore. What about the generated files (like protos, or the yaml definitions copied from other parts of the repo)? On Thu, Oct 12, 2023 at 12:27 PM Anand Inguva via dev wrote: > The

Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Robert Bradshaw via dev
On Thu, Oct 12, 2023 at 2:04 PM Anand Inguva wrote: > I am in the process of updating the documentation at > https://cwiki.apache.org/confluence/display/BEAM/Python+Tips related to > setup.py/pyproject.toml changes, but yes you can't call setup.py directly > because it might fail due to the lack

Re: [Question] Read Parquet Schema from S3 Directory

2023-10-12 Thread Robert Bradshaw via dev
You'll probably need to resolve "s3a:///*.parquet" out into a concrete non-glob filepattern to inspect it this way. Presumably any individual shard will do. match and open from https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileSystems.html may be useful. On Wed, Oct 11, 2

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-12 Thread Robert Bradshaw via dev
ame and > (file extension, for example) which can be useful for some downstream > use-cases. Rest of the filename will be filled out by the SDK (window, pane > etc.) to make sure that the files written by different workers do not > conflict. > > Thanks, > Cham > > >&g

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via dev
Thanks for the PR. I think we should follow Java and allow non-unique labels, but not provide automatic uniquification, In particular, the danger of using a counter is that one can get accidental (and potentially hard to check) off-by-one collisions. As a concrete example, imagine one partitions a

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via dev
On Fri, Oct 13, 2023 at 10:08 AM Joey Tran wrote: > That makes sense. Would you suggest the new option simply suppress the > RuntimeError and use the non-unique label? > Yes. (Or, rather, not raise it.) > Are there places on the SDK side that expect unique labels? Or in > non-updateable runner

  1   2   3   >