This is a great idea. I would like to approach this from the
perspective of making it easy to provide a catalog of well-defined
transforms for use in expansion services from typical SDKs and also
elsewhere (e.g. for documentation purposes, GUIs, etc.) Ideally
everything about what a transform is (i
+1 (binding).
I verified the release artifacts and signatures, and ran a couple of
simple Python pipelines.
On Mon, Aug 15, 2022 at 12:40 PM Chamikara Jayalath via dev
wrote:
>
>
>
> On Mon, Aug 15, 2022 at 11:37 AM Kiley Sok wrote:
>>
>> Thanks everyone!
>>
>> @Chamikara Jayalath The Spark iss
Thanks. I added some comments to the doc.
I agree with Brian that it makes sense to figure out how this
interacts with batched DoFns, as we'll want to migrate to that.
(Perhaps they're already ready to migrate to as a first step?)
On Fri, Aug 12, 2022 at 1:03 PM Brian Hulette via dev
wrote:
>
>
On Wed, Aug 17, 2022 at 5:15 AM Alexey Romanenko
wrote:
>
> Good point about unanswered SO questions. +1 that we need to improve a
> situation there.
>
> Yes, we may try to stream them to a new dedicated list but it will require
> people here to subscribe to and check regularly one more list whi
If one of your inputs fits into memory, using side inputs is
definitely the way to go. If neither side fits into memory, the cross
product may be prohibitively large to compute even on a distributed
computing platform (a billion times a billion is big, though I suppose
one may hit memory limits wit
On Mon, Sep 19, 2022 at 1:53 PM Stephan Hoyer wrote:
>>
>> > > My team has an internal implementation of a CartesianProduct transform,
>> > > based on using hashing to split a pcollection into a finite number of
>> > > groups and CoGroupByKey.
>> >
>> > Could this be contributed to Beam?
>
>
> I
+1 (binding)
Validated release artifacts and signatures. Tested a Python pipeline
on a clean install.
On Thu, Oct 13, 2022 at 1:22 PM Ritesh Ghorse via dev
wrote:
>
> +1 (non-binding)
> Validated Go SDK Quickstart on Direct and Dataflow runner.
>
> Thanks,
> Ritesh Ghorse
>
> On Thu, Oct 13, 202
I've got one I use in Python too (including drilling down into
composites). It's a portable runner. I should clean it up and make it
generally available.
On Mon, Nov 7, 2022 at 9:25 AM Robert Burke wrote:
>
> The Go SDK has a "dot" runner to visualize pipeline protos as a dot graph but
> it's it
OK, I put this together as a PR:
https://github.com/apache/beam/pull/24037 (I think I've written this
beam proto -> dot code half a dozen times by now...it'll be good to
have it checked in and reusable.)
On Mon, Nov 7, 2022 at 11:06 AM Matt Casters wrote:
>
> Apache Hop pipelines are actually jus
Thanks for looking into this and the careful writeup. I've read the
design doc and it looks great, but have a couple of questions.
(1) Why did you decide on having a single top-level FileWrite
transform whose config is ([common_parameters], [xml-params],
[csv-params], ...) rather than separate sch
Saving up all the breaking changes until a major release definitely
has its downsides (look at Python 3). The migration path is often as
important (if not more so) than the final destination.
As for this particular change, I would question how the benefit (it's
unclear what the exact benefit is--b
While Beam provides powerful APIs for authoring sophisticated data
processing pipelines, it often still has too high a barrier for
getting started and authoring simple pipelines. Even setting up the
environment, installing the dependencies, and setting up the project
can be an overwhelming amount o
On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
wrote:
>
> This is great! I developed a similar template a year or two ago as a
> reference for a customer to speed up their development process and
> unsurprisingly it did speed up their development.
> Here's an example of the config layout I ca
On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
wrote:
>
> > Given the increasing importance of multi language pipelines, it does seem
> > that we should expand the capabilities of the DirectRunner or just go all
> > in on FlinkRunner for testing and local / small scale development
>
> +
On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis wrote:
>
> On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw wrote:
>>
>> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
>> wrote:
>> >
>> > > Given the increasing importance of multi language pipelines, it does
>> > > seem that we should expa
On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
wrote:
>
> On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson wrote:
>>
>> Thanks for the detailed answers!
>>
>> I totally get the points about development & maintenance cost, and,
>> from a user perspective, about getting the performance r
Hi all,
Welcome to 2023! As last year, the Google Dataflow team is kicking the
first week off with a hackathon, and one of the projects proposed this
year was to throw together a rust SDK. If you're interested, you can
follow the progress at
https://github.com/kennknowles/beam/tree/rust/sdks/rust
rk?
>
> On Fri, Jan 6, 2023 at 11:09 AM Robert Bradshaw via dev
> wrote:
>>
>> Hi all,
>>
>> Welcome to 2023! As last year, the Google Dataflow team is kicking the
>> first week off with a hackathon, and one of the projects proposed this
>> year was
Were you ever able to get the local runner to work? If not, some more
context on the errors would be useful.
On Tue, Jan 10, 2023 at 10:00 AM Lina Mårtensson wrote:
>
> Thanks! Moving my DoFn into a new module worked, and that solved the slowness
> as well.
> I tried importing it in setup() as w
On Fri, Jan 13, 2023 at 3:54 PM Becket Qin wrote:
>
> Hi Beam devs,
>
> A few of us are currently working on Flink runner to migrate it away from the
> semi-deprecated DataSet API.
Thanks!
> Can someone help grant the following Jira IDs the permission to work on Jira
> issues?
We have migrate
This sounds reasonable to me. One question I have is why a user would
prefer to stick with the DataSet API if the DataStream API is
available. Would there be any user-visible difference?
On Wed, Feb 1, 2023 at 1:11 AM Becket Qin wrote:
>
> Hi Beam devs,
>
> I'd like to start a discussion about mi
,
>
> Jiangjie (Becket) Qin
>
>
>
> On Thu, Feb 2, 2023 at 1:23 AM Robert Bradshaw via dev
> wrote:
>>
>> This sounds reasonable to me. One question I have is why a user would
>> prefer to stick with the DataSet API if the DataStream API is
>> a
Seams reasonable to me.
On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote:
>
> As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped
> being built and supported since July 2022. I have filed [2] to track the
> resolution of this issue.
>
> Based upon [1], almost eve
Thanks. I added some comments to the doc.
On Mon, Feb 6, 2023 at 1:33 PM Chamikara Jayalath via dev
wrote:
>
> Hi All,
>
> Beam PTransforms are currently primarily identified as operations in a
> pipeline that perform specific tasks. PTransform implementations were
> traditionally linked to spe
+1 (binding)
Validated release artifacts and signatures and tested a couple of
Python pipelines.
On Mon, Feb 13, 2023 at 8:57 AM Alexey Romanenko
wrote:
>
> +1 (binding)
>
> Tested with https://github.com/Talend/beam-samples/
> (Java SDK v8/v11/v17, Spark 3.x runner).
>
> ---
> Alexey
>
> On 13
On Tue, Feb 21, 2023 at 10:59 AM Kenneth Knowles wrote:
>
> Agree with Robert here. The human connection is important. Can we have a
> behaviorbot that reminds the reviewer to be extra welcoming up front, and
> then thankful afterwards, instead? :-)
+1
> That said, a bot comment would at least
FWIW, I'm generally in favor of such a bot. I think it really boils
down to a concrete proposal of what the content (and triggers) would
be.
On Tue, Feb 21, 2023 at 1:36 PM Austin Bennett
wrote:
>
> It is fantastic if generally able to address welcoming newcomers manually [
> @Robert Burke ! ] .
Thanks for pushing this through!
On Wed, Feb 22, 2023 at 10:38 AM Alexey Romanenko
wrote:
>
> Hi all,
>
> As a part of migration the Avro-related classes from Java SDK “core” module
> to a dedicated extension [1] (as it was discussed here [2] and here [3]), two
> important PRs has been merged [
It appears your public key is not published in
https://dist.apache.org/repos/dist/release/beam/KEYS .
On Fri, Mar 3, 2023 at 8:33 AM Anand Inguva via dev wrote:
>
> +1 (non-binding)
> Tested python wordcount quick start
> https://beam.apache.org/get-started/quickstart-py/ on Direct Runner and
>
The released artifacts seem to be missing the last commit at
https://github.com/apache/beam/commit/c528eab18b32342daed53b750fe330d30c7e5224
. Is this essential to the release, or just useful for validating it?
On Fri, Mar 3, 2023 at 11:02 AM Danny McCormick
wrote:
>
> Thanks for calling that out,
+1 (binding).
I verified that the artifacts and signatures all look good, all the
containers are pushed, and tested some pipelines with a fresh install
from one of the Python wheels.
On Fri, Mar 3, 2023 at 11:13 AM Danny McCormick
wrote:
>
> > The released artifacts seem to be missing the last c
Looks like this was taken care of. Thanks, Ritesh!
On Wed, Mar 8, 2023 at 8:48 PM Saifuddin Adenwala wrote:
>
> Greetings !
> Recently I have contributed to the open source apache beam organization and I
> wanted to ask for the review of my pull request .
> Attaching the link of my pull request
The FnApiRunner is primarily for tiny jobs (development and testing)
and holds all the data in memory. You'll likely have to run with a
"real" runner to operate over datasets of this size. If you want to
run locally, you can pass --runner=FlinkRunner and (assuming you have
Java installed) it will r
>
> Wilson(Xiaoshuang) Wang
> Sr. Software Engineer
>
>
> On Mon, Mar 13, 2023 at 12:13 PM Robert Bradshaw via dev
> wrote:
>>
>> The FnApiRunner is primarily for tiny jobs (development and testing)
>> and holds all the data in memory.
I think it'd be good if the intersection between this list and the PMC had
cardinality greater than 1. Ahmet might be a good person to keep there.
On Mon, Apr 17, 2023 at 9:25 AM Danny McCormick via dev
wrote:
> Yeah, that is part of the proposal. To be clear, our end state would be a
> single g
Well, I don't know that PMC should take precedence over release managers if
it comes to that.
On Mon, Apr 17, 2023 at 10:11 AM Danny McCormick
wrote:
> I can ask if we can keep 6 seats instead of 5 (and keep Ahmet in that
> seat). If not, my vote would be to stick with the 5 that I suggested, bu
+1
On Mon, Apr 17, 2023 at 11:20 AM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:
> +1
>
> Thanks,
> Cham
>
> On Mon, Apr 17, 2023 at 11:04 AM Kenneth Knowles wrote:
>
>> +1
>>
>> On Fri, Apr 14, 2023 at 1:30 PM Yi Hu via dev
>> wrote:
>>
>>> Please review the release of the followin
The main concern here seems to be whether using pre-release candidates
would be too disruptive to our workflow. I think this is an easy
hypothesis to test out--we can give using prerelease candidates a try, and
if that indeed turns out to be a problem we can then do the work of
trying to put togeth
+1 to not requiring details like this in the Beam model. There is, however,
the question of how to pass such implementation-detail specific hints to a
runner that requires them. Generally that's done via ResourceHints or
annotations, and while the former seems a good fit it's primarily focused
on s
The artifacts and signatures all look good, and I validated a couple of
Python pipelines in a fresh install.
Assuming all the tests (including the Dataflow ones) pass (modulo the two
mentioned above; seems a fair justification to not block on those) I'm +1
(binding) on this release.
On Wed, Apr 2
If this is acceptable per the release policy, huge +1 to automating
this step (and not limited to java artifacts alone).
On Wed, May 3, 2023 at 1:14 PM Kenneth Knowles wrote:
>
> To Robert: Good point. I didn't click through. There's always the possibility
> that the two branches of the foundati
IIRC, we are supporting Python versions until they are out of support.
This would suggest keeping 3.7 in 2.48. (Not that it matters much.) Is
there a significant gain in dropping 3.7 support before the cut?
On Thu, May 4, 2023 at 8:33 AM Jack McCluskey via dev
wrote:
>
> I'd suggest shooting for
On Fri, May 5, 2023 at 6:27 AM Anand Inguva via dev wrote:
>
> >> Is there a significant gain in dropping 3.7 support before the cut?
>
> No, I think it is just a matter of how soon we want to do it.
Absent a compelling reason otherwise, my view would be to just stick
with the statement of droppi
Thanks for catching this. This does seem severe enough that we need to fix
it before the release.
On Sat, May 6, 2023 at 10:15 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:
> Seems like Python SDK harness containers built for the current RC are
> broken. Please see https://github.co
+1 (binding)
Artifacts and signatures look good to me, Python wheel installs cleanly and
runs simple pipelines.
On Wed, May 10, 2023 at 8:23 AM Bruno Volpato via dev
wrote:
> A little bit late to this thread since I've seen the voting closed email,
> but +1 (non-binding) from me as well.
>
> Te
On Mon, May 15, 2023 at 8:38 AM Moritz Mack wrote:
>
> Hi all,
>
> I was just looking into an old issue again, SerializablePipelineOptions
> calling FileSystems.setDefaultPipelineOptions on deserialization [1]. This
> applies to various runners including Flink and Spark, but not Dataflow as far
Done.
On Tue, May 23, 2023 at 7:36 AM Danny McCormick via dev
wrote:
> Hey everyone, as part of automating our release process (see thread here -
> https://lists.apache.org/thread/mw9dbbdjtkqlvs0mmrh452z3jsf68sct), could
> a PMC member please add the infra supplied gpg public key to our release
Yes, with_hot_key_fanout only performs a single level of fanout. I don't
think fanning out more than this has been explored, but I would imagine
that for most cases the increased IO would negate most if not all of the
benefits.
In particular, note that we already do "combiner lifting" to do as muc
On Tue, May 30, 2023 at 10:37 AM Kenneth Knowles wrote:
>
> On Sat, May 27, 2023 at 4:20 PM Stephan Hoyer via dev
> wrote:
>
>> On Fri, May 26, 2023 at 2:59 PM Robert Bradshaw
>> wrote:
>>
>>> Yes, with_hot_key_fanout only performs a single level of fanout. I don't
>>> think fanning out more th
+1 to this simplification, it's a historical artifact that provides no
value.
I would love it if we also looked into ways to auto-generate the
SchemaTransformProvider (e.g. via introspection if a transform takes a
small number of arguments, or uses the standard builder pattern...),
ideally with so
you have a Schema we can auto-generate
> the associated builder on the PTransform? Either way, more DRY.
>
> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to this simplification, it's a historical artifact that provides
On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev
wrote:
> Thanks Danny and Jack! Dataflow containers are up!
>
> Only PMC votes count but feel free to test your use cases and vote on this
> thread!
>
While we need at least 3 affirmative PMC votes to formally do a release, it
is definitely t
+1 (binding)
On Tue, May 30, 2023 at 5:42 PM Robert Bradshaw wrote:
> On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev
> wrote:
>
>> Thanks Danny and Jack! Dataflow containers are up!
>>
>> Only PMC votes count but feel free to test your use cases and vote on
>> this thread!
>>
>
> While w
I went ahead and took care of this today. Should be good to go.
On Fri, Jun 2, 2023 at 10:29 AM Ritesh Ghorse via dev
wrote:
> The preceding steps before this (blog post, releasing on github, website
> updates are done).
> Waiting on this to be completed before sending out the official
> announc
If you absolutely cannot tolerate concurrency an external locking mechanism
is required. While a distributed system often waits for a work item to fail
before trying it, this is not always the case (e.g. backup workers may be
scheduled and whoever finishes first is determined to be the successful
a
We also presented YAML in the lightning talk session at the Beam summit
where it was well received. (No slides, mostly showed examples and talked
about the concept.)
Another useful doc is a list of proposed projects/improvements:
https://s.apache.org/beam-yaml-pipelines-improvements
Also, feel fr
ava and cross-language.
>>>>>>
>>>>>
>>>>> +1 but I think the hard part today is to convert existing PTransforms
>>>>> to be schema-aware transform compatible (for example, change input/output
>>>>> types and mak
I'm also -1 on introducing another forum, and concur with Alexey that
mailing lists are a (required) deep part of the culture for apache
projects.
If there's something qualitatively and significantly different about
discussions that makes it a better fit, I would consider it. (E.g. IMHO the
struct
On Wed, Jun 14, 2023 at 2:05 PM Austin Bennett wrote:
> A few additional thoughts:
>
> * @Anyone --> Should each starter repo allow issues? Or, better to file
> issues in https://github.com/apache/beam/issues ?
>
I'm on the fence. It would make sense to allow issues here, but I'm also
concerne
+1 (binding)
Verified all the artifacts and signatures, spot-checked docker images, and
tried a simple pipeline against one of the Python wheels in a fresh
install. Thanks for putting this together.
On Thu, Jul 13, 2023 at 2:03 AM Jan Lukavský wrote:
> +1 (binding)
>
> Tested Java SDK with Flin
Thanks. Left a few comments on the doc. Looking forward to ARM support.
On Tue, Jul 18, 2023 at 3:59 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:
> Hi Celeste,
>
> Thanks for the proposal and researching the options. Using multi-arch
> images seems like a good way to reduce the co
+1 (binding)
Verified the signatures and checksums of the artifacts and (somewhat
vacuous) source tarball. Also verified the artifact doesn't leak classes
outside apache/beam/vendor .
It'd be great if we could move this artifact creation and signing to CI,
see https://lists.apache.org/thread/y5b3
On Tue, Aug 8, 2023 at 9:50 AM Robert Burke wrote:
>
> Either we keep OWNERS and have the review bot use them, or we remove them and
> use the reviews bot config as the single source of truth.
+1. And I don't see any reason we're going to be any better at keeping
them up to date than we have in
Yes, this is a great step forward (both the automation, and the clarified
guidance). Hopefully we can automate virtual everything but the voting
away.
On Thu, Aug 24, 2023 at 8:56 AM Ismaël Mejía wrote:
> Ah excellent, I was not aware it was the case, great to know we are in
> advance !
>
> On T
Ah, I see.
Yeah, I've thought about using an iterable for the whole bundle rather than
start/finish bundle callbacks, but one of the questions is how that would
impact implicit passing of the timestamp (and other) metadata from
input elements to output elements. (You can of course attach the metad
I would like to figure out a way to get the stream-y interface to work, as
I think it's more natural overall.
One hypothesis is that if any elements are carried over loop iterations,
there will likely be some that are carried over beyond the loop (after all
the callee doesn't know when the loop is
+1
I'd love for this information to be accessible programmatically as well
(both directions: extracting parameters from a transform and constructing a
transform from parameters). Making this pattern easy could encourage
compliance.
On Thu, Aug 24, 2023 at 8:54 AM Svetak Sundhar via dev
wrote:
>
On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath
wrote:
>
>
> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw
> wrote:
>
>> I would like to figure out a way to get the stream-y interface to work,
>> as I think it's more natural overall.
>>
>> One hypothesis is that if any elements are carrie
+1 (binding)
Verified the artifacts and signatures, they all look good. Tried a simple
Python pipeline in a fresh install that worked fine. Thanks for putting
this together.
On Thu, Aug 24, 2023 at 4:09 PM Robert Burke wrote:
> Hi everyone,
> Please review and vote on the release candidate #2
I think this is a great library. I'm on the fence of whether it makes sense
to include with Beam proper vs. be a library that builds on top of Beam.
(Would there be benefits of tighter integration? There is the
maintenance/loss of governance issue.) I am definitely not on the side that
the entire B
Could you clarify what you mean by annotating the transform?
On Fri, Sep 15, 2023 at 9:57 AM Joey Tran wrote:
> While implementing a runner, we tried annotating a CombineByKey transform.
> I noticed that the annotations for the CBK are then lost in the fusion
> optimization stage when the CBK is
Very clear now :). Thanks for the fix; it looks good.
On Fri, Sep 15, 2023 at 5:07 PM Joey Tran wrote:
> Ended up just filing a PR [1]
>
> [1] https://github.com/apache/beam/pull/28489
>
>
> On Fri, Sep 15, 2023 at 12:51 PM Joey Tran
> wrote:
>
>> While implementing a runner, we tried annotatin
TBH, I'm not a huge fan of the wikis either. My ideal flow would be
something like g3doc, and markdown files in github do a reasonable enough
job emulating that. (I don't think the overhead of having to do a PR for
small edits like typos is oneros, as those are super easy reviews to do as
well...)
Dataflow uses a work-stealing protocol. The FnAPI has a protocol to ask the
worker to stop at a certain element that has already been sent.
On Thu, Sep 21, 2023 at 4:24 PM Joey Tran wrote:
> Writing a runner and the first strategy for determining bundling size was
> to just start with a bundle s
On Fri, Sep 22, 2023 at 7:23 AM Byron Ellis via dev
wrote:
> I've actually wondered about this specifically for streaming... if you're
> writing a pipeline there it seems like you're often going to want to put
> high fixed cost things like database connections even outside of the bundle
> setup.
Related, I stumbled
>> across this the other day: https://github.com/apache/beam-site which
>> appears to be unused which could probably even have different review and
>> committer sets if we wanted?
>>
>> On Thu, Sep 21, 2023 at 3:19 PM Robert Bradshaw via dev <
&
On Fri, Sep 22, 2023 at 10:58 AM Jan Lukavský wrote:
> On 9/22/23 18:07, Robert Bradshaw via dev wrote:
>
> On Fri, Sep 22, 2023 at 7:23 AM Byron Ellis via dev
> wrote:
>
>> I've actually wondered about this specifically for streaming... if you're
>> w
Given the interest in the YAML work by multiple parties, we put together
https://s.apache.org/beam-yaml-contribute to more easily coordinate on this
effort. Nothing that surprising--we're going to continue using the standard
lists, github, etc.--but it should help for folks who want to get started.
nction
>>> <https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/TwoPhaseCommitSinkFunction.html>
>>> which
>>> waits for a checkpoint. In Beam, this is the reason we introduced
>>> RequiresSt
>> elements are not considered "processed" until FinishBundle.
>>>>>
>>>>> You are right about Flink. In many cases this is fine - if Flink rolls
>>>>> back to the last checkpoint, the watermark will also roll back, and
>>>
BeamJava and BeamPython have the exact same behavior: transform names
within must be distinct [1]. This is because we do not necessarily know at
pipeline construction time if the pipeline will be streaming or batch, or
if it will be updated in the future, so the decision was made to impose
this res
+1 (binding)
Verified artifacts and signatures and tested a simple python pipeline in a
fresh environment with a wheel.
On Wed, Oct 4, 2023 at 8:05 AM Ritesh Ghorse via dev
wrote:
> +1 (non-binding) validated Go SDK quickstart and Python Streaming
> quickstart on Dataflow runner.
>
> Thanks!
>
Huh. This used to be a hard error in Java, but I guess it's togglable
with an option now. We should probably add the option to toggle Python too.
(Unclear what the default should be, but this probably ties into
re-thinking how pipeline update should work.)
On Thu, Oct 5, 2023 at 4:58 AM Joey Tran
Currently the various file writing configurations take a single parameter,
path, which indicates where the (sharded) output should be placed. In other
words, one can write something like
pipeline:
...
sink:
type: WriteToParquet
config:
path: /beam/filesytem/dest
and
.On Mon, Oct 9, 2023 at 1:11 PM Robert Burke wrote:
> I'll note that the file "Writes" in the Go SDK are currently an unscalable
> antipattern, because of this exact question.
>
> Aside from carefully examining other SDKs it's not clear how one authors
> a reliable, automatically shardable, wind
ing pattern, one could have a full filepattern that
includes format parameters for dynamically computed bits as well as the
shard number, windowing info, etc. (There are pros and cons to this.)
> On Mon, Oct 9, 2023 at 12:37 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>&
"real" SDKs.
>>>
>>> For dynamic destinations, I think just making the "path" component
>>> support a lambda that is parameterized by the input should be adequate
>>> since this allows customers to direct files written to different
>>> destination
gt;>
>>> For dynamic destinations, I think just making the "path" component
>>> support a lambda that is parameterized by the input should be adequate
>>> since this allows customers to direct files written to different
>>> destination directories.
>>>
ould be the best way to specify a lambda here though.
> Maybe a regex or the name of a Python callable ?
>
I'd rather not require Python for a pure Java pipeline, but some kind of a
pattern template may be sufficient here.
> On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev &
hould be adequate
>>>> since this allows customers to direct files written to different
>>>> destination directories.
>>>>
>>>> sink:
>>>> type: WriteToParquet
>>>> config:
>>>> path:
>>
gt;> type: WriteToParquet
>>> config:
>>> path:
>>> prefix:
>>> suffix:
>>>
>>> I'm not sure what would be the best way to specify a lambda here though.
>>> Maybe a regex or the name of a Python
>>>> destination directories.
>>>>
>>>> sink:
>>>> type: WriteToParquet
>>>> config:
>>>> path:
>>>> prefix:
>>>> suffix:
>>>>
>>>> I&
ty of use-cases I believe. Fully
>>>>>> customizing the file pattern sounds like a more advanced use case that
>>>>>> can
>>>>>> be left for "real" SDKs.
>>>>>>
>>>>>
>>>>> Yea, we don
I would definitely support a PR making this an option. Changing the default
would be a rather big change that would require more thought.
On Tue, Oct 10, 2023 at 4:24 PM Joey Tran wrote:
> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
> Apache Beam at my company and I'm
Does this change any development practices? E.g. if I clone the repo, I'm
assuming I couldn't run "setup.py test" anymore. What about the generated
files (like protos, or the yaml definitions copied from other parts of the
repo)?
On Thu, Oct 12, 2023 at 12:27 PM Anand Inguva via dev
wrote:
> The
On Thu, Oct 12, 2023 at 2:04 PM Anand Inguva wrote:
> I am in the process of updating the documentation at
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips related to
> setup.py/pyproject.toml changes, but yes you can't call setup.py directly
> because it might fail due to the lack
You'll probably need to resolve "s3a:///*.parquet" out into a
concrete non-glob filepattern to inspect it this way. Presumably any
individual shard will do. match and open from
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileSystems.html
may be useful.
On Wed, Oct 11, 2
ame and
> (file extension, for example) which can be useful for some downstream
> use-cases. Rest of the filename will be filled out by the SDK (window, pane
> etc.) to make sure that the files written by different workers do not
> conflict.
>
> Thanks,
> Cham
>
>
>&g
Thanks for the PR.
I think we should follow Java and allow non-unique labels, but not provide
automatic uniquification, In particular, the danger of using a counter is
that one can get accidental (and potentially hard to check) off-by-one
collisions. As a concrete example, imagine one partitions a
On Fri, Oct 13, 2023 at 10:08 AM Joey Tran
wrote:
> That makes sense. Would you suggest the new option simply suppress the
> RuntimeError and use the non-unique label?
>
Yes. (Or, rather, not raise it.)
> Are there places on the SDK side that expect unique labels? Or in
> non-updateable runner
1 - 100 of 235 matches
Mail list logo