Re: Is it safe to cache the value of a singleton view (with a global window) in a DoFn?

2019-05-06 Thread Kenneth Knowles
A singleton view in the global window and no triggering does have just a single immutable value. (It really ought to have an updated value in the presence of triggers, but I believe instead you will receive a crash. I haven't tested.) In general, a side input yields one value per window. Dataflow

Re: Beam Summit Europe: speakers and schedule online!

2019-05-23 Thread Kenneth Knowles
Nice! What a great spread of topics. This should be an amazing event. Kenn On Thu, May 23, 2019 at 1:58 PM Joana Filipa Bernardo Carrasqueira < joanafil...@google.com> wrote: > Hi all! > > Looking forward to the conversations about Beam and to meet new people in > the community! > > Please help

Re: a fix to send RabbitMq messages

2019-05-24 Thread Kenneth Knowles
Coders are all set up by the SDK before the pipeline is given to a runner, so that sounds like a strange issue. Would you also file a Jira ticket about your experience with the coder registry and the DataflowRunner? On Fri, May 24, 2019 at 5:26 AM Nicolas Delsaux wrote: > Thanks, PR is started (

Re: How Can I access the key in subclass of CombinerFn when combining a PCollection of KV pairs?

2019-06-07 Thread Kenneth Knowles
The answer on StackOverflow looks good to me. Kenn On Fri, Jun 7, 2019 at 4:11 AM Massy Bourennani wrote: > Hi, > > here is the SO post: > > https://stackoverflow.com/questions/56451796/how-can-i-access-the-key-in-subclass-of-combinerfn-when-combining-a-pcollection > > Many thanks >

Re: Questions about the bundled PubsubIO read implementation

2019-07-10 Thread Kenneth Knowles
This is pretty surprising. Seems valuable to file separate Jiras so we can track investigation and resolution. - use gRPC: https://issues.apache.org/jira/browse/BEAM-7718 - empty message bodies: https://issues.apache.org/jira/browse/BEAM-7716 - watermark tracking: https://issues.apache.org/jira

Re: SDK support status clarification

2019-07-11 Thread Kenneth Knowles
That page is a better authority than this list. It has all the public information and is up to date. What you may be most interested in is the orange box describing the decommissioning you mention: "The new end date has yet to be finalized but is expected to happen in 2019. When decommissioning ha

Re: [Java] TextIO not reading file as expected

2019-07-11 Thread Kenneth Knowles
Doesn't sound good. TextIO has been around a long time so I'm surprised. Would you mind creating a ticket in Jira ( https://issues.apache.org/jira/projects/BEAM/) and posting some technical details, like input/output/code snippets? Kenn On Thu, Jul 11, 2019 at 9:45 AM Shannon Duncan wrote: > I

Re: [Java] TextIO not reading file as expected

2019-07-12 Thread Kenneth Knowles
> > On Thu, Jul 11, 2019 at 10:28 PM Kenneth Knowles wrote: > >> Doesn't sound good. TextIO has been around a long time so I'm surprised. >> Would you mind creating a ticket in Jira ( >> https://issues.apache.org/jira/projects/BEAM/) and posting some &

Re: Beam release 2.5.0 tag SNAPSHOT version

2019-07-17 Thread Kenneth Knowles
Looks like it is this: https://github.com/apache/beam/tree/4838ae16c172252bc0a15e3a984e085f82e25c2d I believe the release manager created the tag to point to the tip of the release branch after the maven release plugin rolled that change back (this is how the maven release plugin works since it do

Re: Beam release 2.5.0 tag SNAPSHOT version

2019-07-17 Thread Kenneth Knowles
I take that back - misclicked on 0.5.0 (which has a correct tag). On Wed, Jul 17, 2019 at 3:34 PM Kenneth Knowles wrote: > Looks like it is this: > https://github.com/apache/beam/tree/4838ae16c172252bc0a15e3a984e085f82e25c2d > > I believe the release manager created the tag to poin

Re: Beam release 2.5.0 tag SNAPSHOT version

2019-07-17 Thread Kenneth Knowles
properties: https://github.com/apache/beam/blob/v2.5.0/gradle.properties I do believe this should be 2.5.0, not 2.5.0-RC2. But anyhow I think that is the commit that was used to build 2.5.0. Kenn On Wed, Jul 17, 2019 at 3:36 PM Kenneth Knowles wrote: > I take that back - misclicked on 0.

Re: Beam release 2.5.0 tag SNAPSHOT version

2019-07-19 Thread Kenneth Knowles
will publish it as a SNAPSHOT > version. Shall I raise a PR for this? > > On Wed, Jul 17, 2019 at 3:41 PM Kenneth Knowles wrote: > >> What you have pointed to is the tip of the release-2.5.0 branch. The >> gradle release plugin copies the maven release plugin. So it has rolled

Re: applying keyed state on top of stream from co-groupByKey output

2019-07-25 Thread Kenneth Knowles
Thanks for the very detailed question! I have written up an answer and I suggest we continue discussion there. Kenn On Tue, Jul 23, 2019 at 9:11 PM Kiran Hurakadli wrote: > > Hi All, > I am trying to merge 2 data streams using coGroupByKey and applying > stateful > ParDo. Input to the cogroup

Re: How to merge two streams and perform stateful operations on merged stream using Apache Beam

2019-07-25 Thread Kenneth Knowles
Is this the same question as your other email about your StackOverflow question? If it is, then please see my answer on StackOverflow. If it is not, can you explain a little more? Kenn On Wed, Jul 24, 2019 at 10:48 PM Kiran Hurakadli wrote: > I have 2 kafka streams , i want to merge by some ke

Re: Choosing a coder for a class that contains a Row?

2019-07-26 Thread Kenneth Knowles
The most challenging part, as I understand it, surrounds automatically inferred schemas from POJOs, where Java's nondeterministic iteration order, combined with a row's inherent ordering, means that even an identical pipeline will need some metadata to plumb the right fields to the right column ind

Re: [Java] Accessing state from FinishBundle method

2019-07-31 Thread Kenneth Knowles
Because @FinishBundle is not executed in the context of a window, what state would you be accessing? (analogous to the way that outputs from finish bundle must have an explicit window specified) It may make sense to have a separate method @FinishBundleForWindow (or some better name) that can be ca

Re: Query about JdbcIO.readRows()

2019-08-21 Thread Kenneth Knowles
Hi Kishor, If you could not find a Jira, would you file one? Your contribution would be very appreciated. Kenn On Tue, Aug 20, 2019 at 10:04 PM Kishor Joshi wrote: > Hi, > > This fix is still not available in the Beam 2.15.0. Is there any Jira that > has been created for this issue ? I am inte

Re: Hackathon @BeamSummit @ApacheCon

2019-08-22 Thread Kenneth Knowles
I will be at Beam Summit / ApacheCon NA and would love to drop by a hackathon room if one is arranged. Really excited for both my first ApacheCon and Beam Summit (finally!) Kenn On Thu, Aug 22, 2019 at 10:18 AM Austin Bennett wrote: > And, for clarity, especially focused on Hackathon times on M

Re: Beam meetup Seattle!! September 26th, 6pm

2019-09-25 Thread Kenneth Knowles
Thanks for organizing. I'll be there! Kenn On Wed, Sep 25, 2019 at 2:50 PM Aizhamal Nurmamat kyzy wrote: > Gentle reminder that Seattle Apache Beam meetup is happening tomorrow! > > Here is a quick agenda: > - 18:00 - Registrations, speed networking, pizza and drinks. > - 18:30 - kick-off > - 1

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Kenneth Knowles
I am pretty surprised that we do not have a @Category(ValidatesRunner) test in GroupByKeyTest that iterates multiple times. That is a major oversight. We should have this test, and it can be disabled by the SparkRunner's configuration. Kenn On Fri, Sep 27, 2019 at 9:24 AM Reuven Lax wrote: > Th

Re: Beam discarding massive amount of events due to Window object or inner processing

2019-10-08 Thread Kenneth Knowles
This is an unfortunate usability problem with triggers where you can accidentally close the window and drop all data. I think instead, you probably want this trigger: Repeatedly.forever( AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Dura

Re: Limited join with stop condition

2019-10-10 Thread Kenneth Knowles
Interesting! I agree with Luke that it seems not a great fit for Beam in the most rigorous sense. There are many considerations: 1. We assume ParDo has side effects by default. So the model actual *requires* eager evaluation, not lazy, in order to make all the side effects happen. But for your cas

Re: Joining PCollections to aggregates of themselves

2019-10-11 Thread Kenneth Knowles
This seems a great example of use of stateful DoFn. It has essentially the same structure as the example on the Beam blog but is more meaningful. Kenn On Fri, Oct 11, 2019 at 12:38 PM Robert Bradshaw wrote: > OK, the only way to do this would be via a non-determanistic stateful > DoFn that buff

Re: Streaming data from Pubsub to Spanner with Beam dataflow pipeline

2019-10-30 Thread Kenneth Knowles
Moving to user@beam.apache.org, the best mailing list for questions like this. Yes, this kind of workload is a core use case for Beam. If you have a problem, please write to this user list with details. Kenn On Wed, Oct 30, 2019 at 4:07 AM Taher Koitawala wrote: > Hi All, > My curren

Re: Multiple triggers contained w/in side input?

2019-11-05 Thread Kenneth Knowles
On Tue, Nov 5, 2019 at 9:29 PM Aaron Dixon wrote: > From https://beam.apache.org/documentation/programming-guide/#side-inputs > > > If the side input has multiple trigger firings, Beam uses the value > from the latest trigger firing. This is particularly useful if you use a > side input with a si

Re: Multiple triggers contained w/in side input?

2019-11-06 Thread Kenneth Knowles
ke how View.asMap() will crash if you have multiple triggers on a per-key Combine/Sum/etc. > Thanks, > Rahul > > On Wed, Nov 6, 2019 at 11:24 AM Kenneth Knowles wrote: > >> >> >> On Tue, Nov 5, 2019 at 9:29 PM Aaron Dixon wrote: >> >>> From

Re: slides?

2019-11-15 Thread Kenneth Knowles
We have a section for this: https://beam.apache.org/community/presentation-materials/. Right now "Presentation Materials" has the appearance of carefully curated stuff from a core team. That was probably true three years ago, but now it is simply out of date. A lot of the material is so old that i

Re: Fanout and OOMs on Dataflow while streaming to BQ

2019-11-22 Thread Kenneth Knowles
+Lukasz Cwik +Chamikara Jayalath It sounds like your high-fanout transform that listens for new files on Pubsub would be a good fit for Splittable DoFn (SDF). It seems like a fairly common use case that could be a useful general contribution to Beam. Kenn On Fri, Nov 22, 2019 at 7:50 AM Jeff K

Re: Scio 0.8.0 released

2020-01-09 Thread Kenneth Knowles
So fast! Excellent. On Wed, Jan 8, 2020 at 11:28 AM Robert Bradshaw wrote: > Nice! > > On Wed, Jan 8, 2020 at 10:03 AM Neville Li wrote: > >> Hi all, >> >> We just released Scio 0.8.0. This is based on the most recent Beam 2.17.0 >> release and includes a lot of new features & bug fixes over th

Re: Question about triggering

2020-01-09 Thread Kenneth Knowles
Does it have the same behavior in the direct runner? What are the sizes of intermediate PCollections? Kenn On Wed, Jan 8, 2020 at 1:05 PM Andrés Garagiola wrote: > Hi all, > > I'm doing some tests with beam and apache flink. I'm running the code > below: > > public static void main(String[] a

Re: Question about triggering

2020-01-13 Thread Kenneth Knowles
th the direct runner I will tried that. The size of the > collections is about 600 records. > > Thanks > Regards > > On Thu, Jan 9, 2020 at 11:56 PM Kenneth Knowles wrote: > >> Does it have the same behavior in the direct runner? What are the sizes >> of intermediate PC

Re: Escalating Checkpoint durations in Flink with SQS Source

2020-01-17 Thread Kenneth Knowles
Starting with just checking some basic things: What is the space of keys? Is it constantly growing? Are you explicitly clearing out state from stale keys? In the global window, you don't get any state GC for free. Can you share repro code? Kenn On Fri, Jan 17, 2020 at 8:53 AM Stephen Patel wrot

Re: Escalating Checkpoint durations in Flink with SQS Source

2020-01-17 Thread Kenneth Knowles
roups them, > and generates some metrics. The read pipeline will operate fine for. > > I've also included the flink-conf that I was using. > > Stephen > > On Fri, Jan 17, 2020 at 11:47 AM Kenneth Knowles wrote: > > > > Starting with just checking some basic th

Re: BigQuery TIMESTAMP and TimestampedValue()

2020-01-22 Thread Kenneth Knowles
Ah, that's too bad. I wonder why they chose to put " UTC" on the end instead of just a "Z". Other than that, the format is RFC3339 and the iso8601 module does have the extension to use a space instead of a T to separate the date and time. I tested and if you strip the " UTC" then parsing succeeds.

Re: Stability of Timer.withOutputTimestamp

2020-02-05 Thread Kenneth Knowles
It is definitely too new to be stable in the sense of not even tiny changes to the API / runtime compatibility. However, in my opinion it is so fundamental (and overdue) it will certainly exist in some form. Feel free to use it if you are OK with the possibility of minor compile-time adjustments

Re: Implementing custom session with max event/element count

2020-02-12 Thread Kenneth Knowles
I notice that you use the name "IntervalWindow" but you are calling methods that IntervalWindow does not have. Do you have a custom implementation of this class? Do you have a custom coder for your version of IntervalWindow? Kenn On Wed, Feb 12, 2020 at 7:30 PM Jainik Vora wrote: > Hi Everyone,

Re: Unbounded input join Unbounded input then write to Bounded Sink

2020-02-24 Thread Kenneth Knowles
I think actually it depends on the pipeline. You cannot do it all in SQL, but if you mix Java and SQL I think you can do this. If you write this: pipeline.apply(KafkaIO.read() ... ) .apply(Window.into(FixedWindows.of(1 minute)) .apply(SqlTransform("SELECT ... FROM stream1 JOIN

Re: MapReduceRunner

2020-02-26 Thread Kenneth Knowles
There's a (very old) prototype on a branch at https://github.com/apache/beam/tree/mr-runner. No one has taken up the work of finishing it up to merge it, and there's no one to maintain it. We intend to remove this incomplete runners from the website. Kenn On Wed, Feb 26, 2020 at 3:07 PM Maulik S

Re: GCS numShards doubt

2020-03-02 Thread Kenneth Knowles
For bounded data, each bundle becomes a file: https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356 Kenn On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver wrote: > As Luke and Robert indicated, unsetting nu

Re: A new reworked Elasticsearch 7+ IO module

2020-03-06 Thread Kenneth Knowles
Since the user provides backendVersion, here are some possible levels of things to add in expand() based on that (these are extra niceties beyond the agreed number of releases to remove) - WARN for backendVersion < n - reject for backendVersion < n with opt-in pipeline option to keep it working

Re: Using Self signed root ca for https connection in eleasticsearchIO

2020-04-07 Thread Kenneth Knowles
Hi Mohil, Thanks for the detailed report. I think most people are reduced capacity right now. Filing a Jira would be helpful for tracking this. Since I am writing, I will add a quick guess, but we should move to Jira. It seems this has more to do with Dataflow than ElasticSearch. The default for

Re: Global Window

2020-04-15 Thread Kenneth Knowles
In batch, with all bounded data, processing time timers are typically not processed. This is because the window is first fully processed and expired. Can you explain a bit more about why you want a processing time timer in your use case? Kenn On Wed, Apr 15, 2020 at 9:41 PM Aniruddh Sharma wrot

Re: Distributed Tracing in Apache Beam

2020-04-21 Thread Kenneth Knowles
+dev I don't have a ton of time to dig in to this, but I wanted to say that this is very cool and just drop a couple pointers (which you may already know about) like Explaining Outputs in Modern Data Analytics [1] which was covered by The Morning Paper [2]. This just happens to be something I rea

Re: Recommended Reading for Apache Beam

2020-04-21 Thread Kenneth Knowles
I believe Streaming Systems is the most Beam-oriented book available. Kenn On Mon, Apr 20, 2020 at 3:07 PM Rion Williams wrote: > Hi all, > > I posed this question over on the Apache Slack Community however didn't > get much of a response so I thought I'd reach out here. I've been looking > for

Re: Running NexMark Tests

2020-04-21 Thread Kenneth Knowles
We should always want to shut down sources on final watermark. All incoming data should be dropped anyhow. Kenn On Tue, Apr 21, 2020 at 1:34 PM Luke Cwik wrote: > +dev > > When would we not want --shutdownSourcesOnFinalWatermark=true ? > > On Tue, Apr 21, 2020 at 1:22 PM Ismaël Mejía wrote: >

Re: Stateful & Timely Call

2020-04-22 Thread Kenneth Knowles
The definition of batch mode for Dataflow is this: completely compute the result of one stage of computation before starting the next stage. There is no way around this. It does not have to do with using state and timers. If you are working with state & timers & triggers, and you are hoping for ou

Re: Notifying the closure of a Window Period

2020-05-01 Thread Kenneth Knowles
I am guessing you will be affected by the fact that windows processed independently for each key. Is that what you are referring to when you mention multiple isLast() windows? Kenn On Fri, May 1, 2020 at 3:36 AM Truebody, Kyle wrote: > Hi all, > > > > We are working on a streaming pipeline that

Re: Notifying the closure of a Window Period

2020-05-01 Thread Kenneth Knowles
; > createTriggerFile(/*tigger file name*/ ".trigger"); *//writes > to the same directory of the current window. This fires multiple time > depending on the number of panes that have isLast() is true/ or write > operators (not sure exactly).* > > r

Re: Non-trivial joins examples

2020-05-01 Thread Kenneth Knowles
+dev @beam and some people who I talk about joins with Interesting! It is a lot to take in and fully grok the code, so calling in reinforcements... Generally, I think there's agreement that for a lot of real use cases, you have to roll your own join using the lower level Beam primitives. So I thi

Re: Notifying the closure of a Window Period

2020-05-01 Thread Kenneth Knowles
the trigger "closing" and dropping data. What version of SDK are you using? Kenn > > .withAllowedLateness(Utilities.resolveDuration(options.getWindowLateness())) > > .discardingFiredPanes()).apply(textWriter); > > > > > > > > *From:* K

Re: Pattern for sharing a resource across all worker threads in Beam Java SDK

2020-05-01 Thread Kenneth Knowles
On Fri, May 1, 2020 at 2:01 PM Luke Cwik wrote: > All worker threads are going to see the same instance. > In Dataflow there is only one JVM containing user code per VM instance. > > One of the reasons around the portability effort is having the worker > nodes/JVMs be consistent across runners so

Re: How to import v2.20.0 to IntelliJ IDEA 2020.1.1?

2020-05-20 Thread Kenneth Knowles
This changed for me recently. I re-run my IntelliJ import regularly just to make sure setting up does not depend on painful manual configuration. My most recent re-run also ended up having no dependencies. I did not track this down or file a bug because I did not have time to confirm it was not jus

Re: DoFn Timer fire multiple times

2020-07-15 Thread Kenneth Knowles
Hello! What runner are you using? Does this reproduce on multiple runners? (it is very quick to try out your pipeline on DirectRunner and local versions of open source runners like Flink, Spark, etc) If you can produce a complete working reproduction it will be easier for someone to debug. I do n

Re: Exceptions: Attempt to deliver a timer to a DoFn, but timers are not supported in Dataflow.

2020-07-29 Thread Kenneth Knowles
Hi Mohil, It helps also to tell us what version of Beam you are using and some more details. This looks related to https://issues.apache.org/jira/browse/BEAM-6855 which claims to be resolved in 2.17.0 Kenn On Mon, Jul 27, 2020 at 11:47 PM Mohil Khare wrote: > Hello all, > > I think I found the

Re: Support of per-key state after windowing

2020-08-22 Thread Kenneth Knowles
Hi Dongwon, On Sat, Aug 22, 2020 at 2:46 PM Dongwon Kim wrote: > Hi all, > > I'm using Beam 2.23.0 with FlinkRunner and a part of my unbounded pipeline > looks like below: > >> p.apply(WithKeys.of(...).withKeyType(...)) // (A) >> .apply(Window.into(FixedWindows.of(...)))// (B) > > .a

Re: Support of per-key state after windowing

2020-08-23 Thread Kenneth Knowles
;>>> ) >>>> .apply(GroupByKey.create()) >>>> // (F) >>>> .apply(ParDo.of(new MyDoFn())) >>>> // (D) >>> >>> >>> I had to include (E1), (E2), (E3), and (F) so that MyDoFn (D) can >>> iterate

Re: Support of per-key state after windowing

2020-08-23 Thread Kenneth Knowles
Yes :-) On Sun, Aug 23, 2020 at 2:16 PM Reuven Lax wrote: > Kenn - shouldn't the Reify happen before the rewindow? > > On Sun, Aug 23, 2020 at 11:08 AM Kenneth Knowles wrote: > >> >> >> On Sun, Aug 23, 2020 at 1:12 PM Dongwon Kim >> wrote: >> &

Re: Count based triggers and latency

2020-10-12 Thread Kenneth Knowles
Another thing to keep in mind - apologies if it was already clear: triggering governs aggregation (GBK / Combine). It does not have any effect on stateful DoFn. On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik wrote: > The default trigger will only fire when the global window closes which > does happen

Re: Which Solr versions should be supported by Beam

2020-10-23 Thread Kenneth Knowles
This might be a good question for u...@solr.apache.org and/or d...@solr.apache.org, too. Kenn On Fri, Oct 23, 2020 at 6:24 AM Piotr Szuberski wrote: > Beam has quite old Solr dependency (5.x.y) which has been deprecated for a > long time. > > Solr dependency has recently been updated to 8.6.y,

Re: [ANNOUNCE] Beam 2.25.0 Released

2020-10-26 Thread Kenneth Knowles
Hooray! Thanks Robin! On Mon, Oct 26, 2020 at 11:51 AM Rui Wang wrote: > Thank you Robin! > > > -Rui > > On Mon, Oct 26, 2020 at 11:44 AM Pablo Estrada wrote: > >> Thanks Robin! >> >> On Mon, Oct 26, 2020 at 11:06 AM Robin Qiu wrote: >> >>> The Apache Beam team is pleased to announce the relea

Re: Help measuring upcoming performance increase in flink runner on production systems

2020-12-21 Thread Kenneth Knowles
I really think we should make a plan to make this the default. If you test with the DirectRunner it will do mutation checking and catch pipelines that depend on the runner cloning every element. (also the DirectRunner doesn't clone). Since the cloning is similar in cost to the mutation detection, c

Re: Combine with multiple outputs case Sample and the rest

2021-01-05 Thread Kenneth Knowles
Perhaps something based on stateful DoFn so there is a simple decision point at which each element is either sampled or not so it can be output to one PCollection or the other. Without doing a little research, I don't recall if this is doable in the way you need. Kenn On Wed, Dec 23, 2020 at 3:12

Re: Side input in streaming

2021-01-05 Thread Kenneth Knowles
You have it basically right. However, there are a couple minor clarifications: 1. A particular window on the side input is not "ready" until there has been some element output to it (or it has expired, which will make it the default value). Main input elements will wait for the side input to be re

Re: Side input in streaming

2021-01-07 Thread Kenneth Knowles
; side input on setup would be the best option. > > On Tue, 5 Jan 2021 at 17:53, Kenneth Knowles wrote: > >> You have it basically right. However, there are a couple minor >> clarifications: >> >> 1. A particular window on the side input is not "ready" un

Re: Side input in streaming

2021-01-12 Thread Kenneth Knowles
ghly appreciate it. > > On Thu, 7 Jan 2021 at 18:29, Kenneth Knowles wrote: > >> Actually, if you want to actually re-read the BQ table then you need >> something more, following the pattern here: >> https://beam.apache.org/documentation/patterns/side-inputs/. There are >>

Re: Is there an array explode function/transform?

2021-01-12 Thread Kenneth Knowles
Explode is called UNNEST in Beam SQL (and I believe this is the more standard name). FlatMap(arr -> arr) is a simple and efficient implementation for straight Beam. Kenn On Tue, Jan 12, 2021 at 4:58 PM Kyle Weaver wrote: > @Reuven Lax yes I am aware of that transform, but >> that’s different

Re: Is there an array explode function/transform?

2021-01-13 Thread Kenneth Knowles
Just the fields specified, IMO. When in doubt, copy SQL. (and I mean SQL generally, not just Beam SQL) Kenn On Wed, Jan 13, 2021 at 11:17 AM Reuven Lax wrote: > Definitely could be a top-level transform. Should it automatically unnest > all arrays, or just the fields specified? > > We do have t

Re: Iterative algorithm in BEAM

2021-01-21 Thread Kenneth Knowles
Your idea works: first read the directory metadata and then downstream an SDF that jumps to file offsets. This is very much what SDF is for. Splitting within a zip entry will not be useful. If you have indeterminate nesting depth, you do need iterative computation. Beam doesn't directly support th

Do you use TimestampCombiner.EARLIEST with SlidingWindows or other overlapping windows?

2021-03-11 Thread Kenneth Knowles
Hi users, ** Do you use TimestampCombiner.EARLIEST with SlidingWindows or other overlapping windows? If no, then you can stop reading now. ** We are considering a change to simplify how timestamps are computed for aggregations in this case. Warning: this is a bit complicated and a curious corner

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Kenneth Knowles
Alex's idea sounds good and like what Vincent maybe implemented. I am just reading really quickly so sorry if I missed something... Checking out the code for the WriteFn I see a big problem: @Setup public void setup() { writer = new Mutator<>(spec, Mapper::saveAsync, "writes");

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Kenneth Knowles
4, 2021 at 5:30 PM Reuven Lax wrote: > >> I believe that the Wait transform turns this output into a side input, so >> outputting the input PCollection might be problematic. >> >> On Wed, Mar 24, 2021 at 4:49 PM Kenneth Knowles wrote: >> >>> Alex's id

Re: General guidance

2021-03-25 Thread Kenneth Knowles
This is a Beam issue indeed, though it is an issue with the FlinkRunner. So I think I will BCC the Flink list. You may be in one of the following situations: - These timers should not be viewed as distinct by the runner, but deduped, per https://issues.apache.org/jira/browse/BEAM-8212#comment-169

Re: Triggering partway through a window

2021-03-29 Thread Kenneth Knowles
That's a neat example! The trigger you have there will emit a ton of output. What is your accumulation mode? I assume it must be accumulatingFiredPanes() otherwise you would not actually have access to the prior 6 days of input. The only trigger that is based on completeness of data is the AfterW

Re: Global window + stateful transformation

2021-03-31 Thread Kenneth Knowles
Great question! Moving this to user@beam.apache.org Kenn On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria < hsuta...@paloaltonetworks.com> wrote: > Beam Developers, > > I have a global window with per-key-and-window stateful processing > dataflow job. Do I need groupbykey in my transform ? Thank

Re: Global window + stateful transformation

2021-03-31 Thread Kenneth Knowles
On Wed, Mar 31, 2021 at 10:20 AM Kenneth Knowles wrote: > > On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria < > hsuta...@paloaltonetworks.com> wrote: > >> I have a global window with per-key-and-window stateful processing >> dataflow job. Do I need groupbykey in my

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Kenneth Knowles
I would assume the main issue is resuming reading from the Kinesis stream from the last read? In the case for Pubsub (just as another example of the idea) this is part of the internal state of a pre-created subscription. Kenn On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey wrote: > Hi list, > > w

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Kenneth Knowles
run "streaming-beam-sql-`date > +%Y%m%d-%H%M%S`" \ > --template-file-gcs-location "$TEMPLATE_PATH" \ > --parameters inputSubscription="$SUBSCRIPTION" \ > --parameters outputTable="$PROJECT:$DATASET.$TABLE" \ > --region "

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Kenneth Knowles
mantics in the event of shutting down a stream and > resuming it at a later point. > > In Kafka you don't need this since as long as we ensure our offsets are > committed in finalization of a bundle, the offsets for a particular group > id are stored on the server. > >

Re: Question on late data handling in Beam streaming mode

2021-04-22 Thread Kenneth Knowles
Hello! In a streaming app, you have two choices: wait forever and never have any output OR use some method to decide that aggregation is "done". In Beam, the way you decide that aggregation is "done" is the watermark. When the watermark predicts no more data for an aggregation, then the aggregati

Re: Question on late data handling in Beam streaming mode

2021-04-23 Thread Kenneth Knowles
ri, Apr 23, 2021 at 8:14 AM Tao Li wrote: > >> Thanks @Kenneth Knowles . I understand we need to >> specify a window for groupby so that the app knowns when processing is >> “done” to output result. >> >> >> >> Is it possible to specify a event arrival/process

Re: DirectRunner, Fusion, and Triggers

2021-05-12 Thread Kenneth Knowles
On Sat, May 8, 2021 at 12:00 AM Bashir Sadjad wrote: > Hi Beam-users, > > *TL;DR;* I wonder if DirectRunner does any fusion optimization > > and whether this has any impact on triggers/panes? > > *Details* (the context for

Re: Beam/Dataflow pipeline backfill via JDBC

2021-05-12 Thread Kenneth Knowles
Can you share some more details, such as code? We may identify something that relies upon assumptions from batch execution style. Also notably the Java DirectRunner does not have separate batch/streaming mode. It always executes in a "streaming" sort of way. It is also simpler in some ways so if y

Re: How to flush window when draining a Dataflow pipeline?

2021-05-21 Thread Kenneth Knowles
+dev +Reuven Lax Advancing the watermark to infinity does have an effect on the GlobalWindow. The GlobalWindow ends a little bit before infinity :-). That is why this works to cause the output even for unbounded aggregations. Kenn On Fri, May 21, 2021 at 5:10 AM Jeff Klukas wrote: > Beam use

Re: Error with Beam/Flink Python pipeline with Kafka

2021-05-24 Thread Kenneth Knowles
Thanks for the details in the gist. I would guess the key to the error is this: 2021/05/24 13:45:34 Initializing java harness: /opt/apache/beam/boot --id=1-2 --provision_endpoint=localhost:33957 2021/05/24 13:45:44 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attemp

Re: Windowing

2021-05-26 Thread Kenneth Knowles
You can use Window.configure() to only set the values you want to change. Is that what you mean? Kenn On Wed, May 26, 2021 at 8:42 AM Sozonoff Serge wrote: > Hi All, > > I find myself having to pepper Window transforms all over my pipeline, I > count about 9 in order to get my pipeline to run.

Re: Windowing

2021-05-26 Thread Kenneth Knowles
hanks, > Serge > > > > > On 26 May 2021 at 18:01:14, Kenneth Knowles (k...@apache.org) wrote: > > You can use Window.configure() to only set the values you want to change. > Is that what you mean? > > Kenn > > On Wed, May 26, 2021 at 8:42 AM Sozonoff Serge wrote

Re: RenameFields behaves differently in DirectRunner

2021-06-01 Thread Kenneth Knowles
On Tue, Jun 1, 2021 at 12:42 PM Brian Hulette wrote: > Hi Matthew, > > > The unit tests also seem to be disabled for this as well and so I don’t > know if the PTransform behaves as expected. > > The exclusion for NeedsRunner tests is just a quirk in our testing > framework. NeedsRunner indicates

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Kenneth Knowles
gt;>>- JIRA ticket - https://issues.apache.org/jira/browse/BEAM-12442 >>>>>> >>>>>> >>>>>> On Tue, Jun 1, 2021 at 5:51 PM Reuven Lax wrote: >>>>>> >>>>>>> This transform is the same across all runn

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Kenneth Knowles
that wouldn't catch it. We could >> insert CoderProperties.coderDecodeEncodeEqual in a subsequent ParDo, but if >> the Direct runner already does an encode/decode before that ParDo, then >> that would have fixed the problem before we could see it. >> >> On Wed, Jun 2, 2021 at 11:53 AM Kenne

Re: RenameFields behaves differently in DirectRunner

2021-06-03 Thread Kenneth Knowles
> I'm assuming a CoderTester would require manually generating inputs right? > These input Rows represent an illegal state that we wouldn't test with. > (That being said I like the idea of a CoderTester in general) > > Brian > > On Wed, Jun 2, 2021 at 12:11 PM Kennet

Re: Allyship workshops for open source contributors

2021-06-07 Thread Kenneth Knowles
Yes please! On Thu, Jun 3, 2021, 18:32 Ratnakar Malla wrote: > +1 > > > -- > *From:* Austin Bennett > *Sent:* Thursday, June 3, 2021 6:20:25 PM > *To:* user@beam.apache.org > *Cc:* dev > *Subject:* Re: Allyship workshops for open source contributors > > +1, assumin

Re: No data sinks have been created yet.

2021-06-10 Thread Kenneth Knowles
Beam doesn't use Flink's sink API. I recall from a very long time ago that we attached a noop sink to each PCollection to avoid this error. +Kyle Weaver might know something about how this applies to Python on Flink. Kenn On Wed, Jun 9, 2021 at 4:41 PM Trevor Kramer wrote: > Hello Beam communi

Re: Help: Apache Beam Session Window with limit on number of events and time elapsed from window start

2021-07-07 Thread Kenneth Knowles
Hi Chandan, I am moving this thread to user@beam.apache.org. I think that is the best place to discuss. Kenn On Wed, Jul 7, 2021 at 9:32 AM Chandan Bhattad wrote: > Hi Team, > > Hope you are doing well. > > I have a use case around session windowing with some customizations. > > We need to hav

Re: How to read data from bigtable

2021-07-12 Thread Kenneth Knowles
Hi Raja & all, I'm moving the thread to user@beam.apache.org since I think many users may be interested in the answer to the question, or have other experiences to share. Kenn On Mon, Jul 12, 2021 at 12:58 PM Chamikara Jayalath wrote: > There's also an external connector that us directly suppo

Re: CombineWithContext

2017-01-03 Thread Kenneth Knowles
Hi Matthias, It seems there is a bit of an API inconsistency here. ParDo has the .withSideInputs(Iterable>) version but also a varargs version withSideInputs(PCollectionView...) but Combine only has the Iterable version. So you have the right idea, and just need to put your PCollectionView into a

Re: KafkaIO Example

2017-01-12 Thread Kenneth Knowles
Hi Naveen, I have just successfully compiled the code from your most recent email so I suspect the error lies elsewhere. Taking a wild guess, the error you are getting would be expected if a transform were upcast to the rawtype PTransform, as its raw output type is POutput. Kenn On Thu, Jan 12,

Re: KafkaIO Example

2017-01-17 Thread Kenneth Knowles
apache.beam.sdk.runners.PipelineRunner.apply( > PipelineRunner.java:76) > > at org.apache.beam.runners.direct.DirectRunner.apply( > DirectRunner.java:295) > > at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java: > 385) > > at o

New blog post: "Stateful processing with Apache Beam"

2017-02-13 Thread Kenneth Knowles
Hi all, I've just published a blog post about Beam's new stateful processing capabilities: https://beam.apache.org/blog/2017/02/13/stateful-processing.html The blog post covers stateful processing from a few angles: how it works, how it fits into the Beam model, what you might use it for, an

Re: Documentation: Side Input and Outputs

2017-02-17 Thread Kenneth Knowles
Hi Tobias, On Fri, Feb 17, 2017 at 7:51 AM, Tobias Feldhaus < tobias.feldh...@localsearch.ch> wrote: > It seems like the documentation [0] about side in- and outputs has missed > some > refactoring of the code. I can’t find Max.MaxIntFn() for example, but > there is a > new Max.ofIntegers() metho

Re: Testing/Running a pipeline with a BigQuery Sink locally with the DirectRunner

2017-02-17 Thread Kenneth Knowles
Hi Tobias, The specific error there looks like you have a forbidden null somewhere deep inside the output of logLine.toTableRow(). Hard to say more with this information. Kenn On Fri, Feb 17, 2017 at 4:46 AM, Tobias Feldhaus < tobias.feldh...@localsearch.ch> wrote: > It seems like this is cause

  1   2   3   >