Re: Issues with python's external ReadFromPubSub

2020-10-30 Thread Maximilian Michels
We used to run legacy sources using the old-style Read translation. Changing it to SDF might have broken ReadFromPubSub. Could you check in the Flink jobs whether it uses the SDF code or the Read translation? For Read you should be seeing the UnboundedSourceWrapper. Looking at the code, there

Re: flink runner 1.10 checkpoint timeout issue

2020-09-18 Thread Maximilian Michels
This type of stack trace occurs when the downstream operator is blocked for some reason. Flink maintains a finite number of network buffers for each network channel. If the receiving downstream operator does not process incoming network buffers, the upstream operator blocks. This is also called

Re: FlinkRunner Graphite metrics

2020-09-15 Thread Maximilian Michels
Which metrics specifically do you mean? Beam metrics (e.g. backlogBytes) or Flink metrics (e.g. numRecordsOut)? Quick look at the code (ReaderInvocationUtil) reveals that the Beam metrics should be reported correctly. The Flink native metrics are always reported, independently of the type of o

Re: Design rational behind copying via serializing in flink runner

2020-09-07 Thread Maximilian Michels
Hey Teodor, Copying is the default behavior. This is tunable via the pipeline option 'objectReuse', i.e. 'objectReuse=true'. The option is disabled by default because users may not be aware of object reuse and recycle objects in their process functions which will have unexpected side effects

Re: Clearing states and timers in a Stateful Fn with Global Windows

2020-09-04 Thread Maximilian Michels
Thanks for reporting Gökhan! Please keep us updated. We'll likely merge the patch by the end of the week. -Max On 03.09.20 08:40, Gökhan Imral wrote: Thanks for the quick response. I tried with a fix applied build and can see that memory is much more stable. Gokhan On 2 Sep 2020, at 12:51 P

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-11 Thread Maximilian Michels
Looks like you ran into a bug. You could just run your program without specifying any arguments, since running with Python's FnApiRunner should be enough. Alternatively, how about trying to run the same pipeline with the FlinkRunner? Use: --runner=FlinkRunner and do not specify an endpoint.

Re: Program and registration for Beam Digital Summit

2020-07-29 Thread Maximilian Michels
Thanks Pedro! Great to see the program! This is going to be an exciting event. Forwarding to the dev mailing list, in case people didn't see this here. -Max On 29.07.20 20:25, Pedro Galvan wrote: Hello! Just a quick message to let everybody know that we have published the program for the Be

Re: KafkaUnboundedReader

2020-07-29 Thread Maximilian Michels
Hi Dinh, The check only de-duplicates in case the consumer processes the same offset multiple times. It ensures the offset is always increasing. If this has been fixed in Kafka, which the comment assumes, the condition will never be true. Which Kafka version are you using? -Max On 29.07.2

Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used

2020-07-28 Thread Maximilian Michels
to runner implementations. -Original Message----- From: Maximilian Michels Sent: Monday, July 27, 2020 3:04 PM To: user@beam.apache.org; Sunny, Mani Kolbe Subject: Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used CAUTION: This email originated from outs

Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used

2020-07-27 Thread Maximilian Michels
ically and generate batched outputs. Without ability to resume from a checkpoint, it will be reading entire stream every time. Regards, Mani -Original Message- From: Maximilian Michels mailto:m...@apache.org>> Sent: Tuesday, July 21, 2020 11:38 AM To:user

Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used

2020-07-21 Thread Maximilian Michels
Hi Mani, BoundedReadFromUnboundedSource was originally intended to be used in batch pipelines. In batch, runners typically do not perform checkpointing. In case of failures, they re-run the entire pipeline. Keep in mind that, even with checkpointing, reading for a finite time in the processi

Re: ReadFromKafka: UnsupportedOperationException: The ActiveBundle does not have a registered bundle checkpoint handler

2020-07-09 Thread Maximilian Michels
This used to be working but it appears @FinalizeBundle (which KafkaIO requires) was simply ignored for portable (Python) pipelines. It looks relatively easy to fix. -Max On 07.07.20 03:37, Luke Cwik wrote: The KafkaIO implementation relies on checkpointing to be able to update the last commit

Re: Beam supports Flink Async IO operator

2020-07-08 Thread Maximilian Michels
Just to clarify: We could make the AsnycIO operator also available in Beam but the operator has to be represented by a concept in Beam. Otherwise, there is no way to know when to produce it as part of the translation. On 08.07.20 11:53, Maximilian Michels wrote: Flink's AsycIO operat

Re: Beam supports Flink Async IO operator

2020-07-08 Thread Maximilian Michels
Flink's AsycIO operator is useful for processing io-bound operations, e.g. sending network requests. Like Luke mentioned, it is not available in Beam. -Max On 07.07.20 22:11, Luke Cwik wrote: Beam is a layer that sits on top of execution engines like Flink and provides its own programming mod

Re: Apache Beam a Complete Guide - Review?

2020-06-30 Thread Maximilian Michels
"Streaming Systems" is a great book to understand the concepts behind Apache Beam and other streaming systems. However, it is not a book about how to write, deploy, or monitor Beam pipelines. Such a book is yet to come out. As for the "Apache Beam" book you linked, that one seems to be a fraud

Re: Flink/Portable Runner error on AWS EMR

2020-06-24 Thread Maximilian Michels
ainer output as > well but I assume that is just receiving the error message from the flink > cluster. > > Thanks, > Jesse > > On 6/23/20, 8:11 AM, "Maximilian Michels" wrote: > > Hey Jesse, > > Could you share the context of the erro

Re: Flink/Portable Runner error on AWS EMR

2020-06-23 Thread Maximilian Michels
Hey Jesse, Could you share the context of the error? Where does it occur? In the client code or on the cluster? Cheers, Max On 22.06.20 18:01, Jesse Lord wrote: > I am trying to run the wordcount quickstart example on a flink cluster > on AWS EMR. Beam version 2.22, Flink 1.10. > >   > > I get

Re: Error restoring Flink checkpoint

2020-06-23 Thread Maximilian Michels
O which has a >>>>>> java.time.Instant field and was encoded with AvroDecoder: >>>>>> java.lang.RuntimeException: java.lang.RuntimeException: >>>>>> java.lang.NoSuchMethodException: java.time.Instant.() >>>>>> at >>>

Re: Beam/Python/Flink: Unable to deserialize UnboundedSource for PubSub source

2020-06-17 Thread Maximilian Michels
You are using a proprietary connector which only works on Dataflow. You will have to use io.external.gcp.pubsub.ReadFromPubsub. PubSub support is experimental from Python. -Max On 09.06.20 06:40, Pradip Thachile wrote: > Quick update: this test code works just fine on Dataflow as well as the > D

Re: Error restoring Flink checkpoint

2020-06-05 Thread Maximilian Michels
ema+binary >> data saved into the checkpoint restoring the object. >> >> Is not like that? >> >> Also I don't understand when you said that a reference to the Beam >> Coder is saved into the checkpoint, because the error I'm getting is >> referencing the java model class (

Re: Error restoring Flink checkpoint

2020-06-04 Thread Maximilian Michels
is generating the AVRO schema >> using >> Java reflection, and then that generated schema is saved within the >> Flink checkpoint, right? >> >> On Wed, 2020-06-03 at 18:00 +0200, Maximilian Michels wrote: >>> Hi Ivan, >>> >>> Moving to the new t

Re: Error restoring Flink checkpoint

2020-06-04 Thread Maximilian Michels
is generating the AVRO schema using > Java reflection, and then that generated schema is saved within the > Flink checkpoint, right? > > On Wed, 2020-06-03 at 18:00 +0200, Maximilian Michels wrote: >> Hi Ivan, >> >> Moving to the new type serializer snapshot interface is

Re: Error restoring Flink checkpoint

2020-06-03 Thread Maximilian Michels
> KafkaIO or Protobuf. *I meant to say "Avro or Protobuf". On 03.06.20 18:00, Maximilian Michels wrote: > Hi Ivan, > > Moving to the new type serializer snapshot interface is not going to > solve this problem because we cannot version the coder through the Beam > c

Re: Error restoring Flink checkpoint

2020-06-03 Thread Maximilian Michels
Hi Ivan, Moving to the new type serializer snapshot interface is not going to solve this problem because we cannot version the coder through the Beam coder interface. That is only possible through Flink. However, it is usually not trivial. In Beam, when you evolve your data model, the only way yo

Re: Beam First Steps Workshop - 9 June

2020-06-03 Thread Maximilian Michels
Awesome! On 02.06.20 22:09, Austin Bennett wrote: > Hi Beam Users, > > Wanted to share the Workshop that I'll give at Berlin Buzzword's next > week:   > https://berlinbuzzwords.de/session/first-steps-apache-beam-writing-portable-pipelines-using-java-python-go > > Do consider joining if you are a

Re: Issue while submitting python beam pipeline on flink - local

2020-06-01 Thread Maximilian Michels
The logs indicate that you are not running the Docker-based execution but the `LOOPBACK` mode. In this mode the Flink cluster needs to connect to the machine that started the pipeline. That will not be possible unless you are running the Flink cluster on the same machine (we bind to `localhost` whi

Re: Flink Runner with HDFS

2020-05-29 Thread Maximilian Michels
uot;hdfs://...">> with beam.Pipeline(options=options) as p: > (p | 'ReadMyFile' >> beam.io.ReadFromText(input_file_hdfs) > |"WriteMyFile" >> beam.io.WriteToText(output_file_hdfs)) -Max On 28.05.20 17:00, Ramanan, Buvana (Nokia - US/Murray Hill

Re: [ANNOUNCE] Beam 2.21.0 Released

2020-05-29 Thread Maximilian Michels
Thanks Kyle! On 28.05.20 13:16, Kyle Weaver wrote: > The Apache Beam team is pleased to announce the release of version 2.21.0. > > Apache Beam is an open source unified programming model to define and > execute data processing pipelines, including ETL, batch and stream > (continuous) processing.

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Maximilian Michels
You are using the LOOPBACK environment which requires that the Flink cluster can connect back to your local machine. Since the loopback environment by defaults binds to localhost that should not be possible. I'd suggest using the default Docker environment. On 28.05.20 14:06, Ashish Raghav wrote:

Re: Issue while submitting python beam pipeline on flink cluster - local

2020-05-28 Thread Maximilian Michels
Potentially a Windows issue. Do you have a Unix environment for testing? On 28.05.20 13:35, Ashish Raghav wrote: > Hi Guys, > >   > > I have another issue when I submit the python beam pipeline ( wordcount > example provided by apache beam team) directly on flink cluster running > local. > >  

Re: Flink Runner with HDFS

2020-05-28 Thread Maximilian Michels
The configuration looks good but the HDFS file system implementation is not intended to be used directly. Instead of: > lines = p | 'ReadMyFile' >> beam.Create(hdfs_client.open(input_file_hdfs)) Use: > lines = p | 'ReadMyFile' >> beam.io.ReadFromText(input_file_hdfs) Best, Max On 28.05.20 0

Re: Webinar: Unlocking the Power of Apache Beam with Apache Flink

2020-05-28 Thread Maximilian Michels
Thanks to everyone who joined and asked questions. Really enjoyed this new format! -Max On 28.05.20 08:09, Marta Paes Moreira wrote: > Thanks for sharing, Aizhamal - it was a great webinar! > > Marta > > On Wed, 27 May 2020 at 23:17, Aizhamal Nurmamat kyzy > mailto:aizha...@apache.org>> wrote:

Re: New Dates For Beam Summit Digital 2020

2020-05-21 Thread Maximilian Michels
gt; > We thank you for your understanding! See you soon! > > -Griselda Cuevas, Brittany Hermann, Maximilian Michels, Austin Bennett, > Matthias Baetens, Alex Van Boxel >

Re: How Beam coders match with runner serialization

2020-05-19 Thread Maximilian Michels
Hi Ivan, Beam does not use Java serialization for checkpoint data. It uses Beam coders which are wrapped in Flink's TypeSerializers. That said, Beam does not support serializer migration yet. I'm curious, what do you consider a "backwards-compatible" change? If you are attempting to upgrade the s

Re: Running NexMark Tests

2020-05-19 Thread Maximilian Michels
gards, > Sruthi > > On Tue, May 12, 2020 at 7:21 PM Maximilian Michels <mailto:m...@apache.org>> wrote: > > A heads-up if anybody else sees this, we have removed the flag: > https://jira.apache.org/jira/browse/BEAM-9900 > > Further contributions are very

Re: TextIO. Writing late files

2020-05-19 Thread Maximilian Michels
> > Thanks > Jose > > > El lun., 18 may. 2020 a las 18:37, Reuven Lax ( <mailto:re...@google.com>>) escribió: > > This is still confusing to me - why would the messages be dropped as > late in this case? > > On Mon, May 18, 2020 at 6:14 AM

Re: TextIO. Writing late files

2020-05-18 Thread Maximilian Michels
All runners which use the Beam reference implementation drop the PaneInfo for WriteFilesResult#getPerDestinationOutputFilenames(). That's why we can observe this behavior not only in Flink but also Spark. The WriteFilesResult is returned here: https://github.com/apache/beam/blob/d773f8ca7a4d63d014

Re: Running NexMark Tests

2020-05-12 Thread Maximilian Michels
A heads-up if anybody else sees this, we have removed the flag: https://jira.apache.org/jira/browse/BEAM-9900 Further contributions are very welcome :) -Max On 11.05.20 17:05, Sruthi Sree Kumar wrote: > I have opened a PR with the documentation change. > https://github.com/apache/beam/pull/11662

Re: Beam + Flink + Docker - Write to host system

2020-05-11 Thread Maximilian Michels
Hey Robbe, The issue with a higher parallelism is likely due to the single Python process which processes the data. You may want to use the `sdk_worker_parallelism` pipeline option which brings up multiple worker Python workers. Best, Max On 30.04.20 23:56, Robbe Sneyders wrote: > Yes, the task

Re: GC overhead limit exceeded

2020-05-11 Thread Maximilian Michels
Generally, it is to be expected that the main input is buffered until the side input is available. We really have no other option to correctly process the data. Have you tried using RocksDB as the state backend to prevent too much GC churn? -Max On 07.05.20 06:27, Eleanore Jin wrote: > Please se

Re: Set parallelism for each operator

2020-05-11 Thread Maximilian Michels
Beam and its Flink Runner do not allow setting the parallelism at the operator level. The wish to configure per-operator came up numerous times over the years. I'm not opposed to allowing for special cases, e.g. via a pipeline option. It doesn't look like it is necessary for the use case discusse

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-05-05 Thread Maximilian Michels
 2.17.0, 2.18.0, 2.19.0, 2.20.0)? Or I have > to wait for the 2.20.1 release?  > > Thanks a lot! > Eleanore > > On Wed, Apr 22, 2020 at 2:31 AM Maximilian Michels <mailto:m...@apache.org>> wrote: > > Hi Eleanore, > > Exactly-once is not affected b

Re: Apache beam job on Flink checkpoint size growing over time

2020-04-30 Thread Maximilian Michels
the moment. > > Many thanks, > > Steve > > > Stephen Hesketh | Client Analytics Technology > S +44 (0)7968 039848 > + stephen.hesk...@natwestmarkets.co.uk > 250 Bishopsgate | London | EC2M 4AA > The information classification of this email is Confidential unless

Re: Running Nexmark for Flink Streaming

2020-04-29 Thread Maximilian Michels
tebackend (filesystem) using the config >> file in the config directory specified by the env variable >> ENV_FLINK_CONF_DIR. >> >> Regards, >> Sruthi >> >> On Tue, Apr 28, 2020 at 11:35 AM Maximilian Michels wrote: >>> >>> Hi Sruthi,

Re: Running Nexmark for Flink Streaming

2020-04-28 Thread Maximilian Michels
Hi Sruthi, Not possible out-of-the-box at the moment. You'll have to add the RocksDB Flink dependency in flink_runner.gradle, e.g.: compile "org.apache.flink:flink-statebackend-rocksdb_2.11:$flink_version" Also in the Flink config you have to set state.backend: rocksdb Then you can run Nex

Re: Apache beam job on Flink checkpoint size growing over time

2020-04-22 Thread Maximilian Michels
Hi Steve, The Flink Runner buffers data as part of the checkpoint. This was originally due to a limitation of Flink where we weren't able to end the bundle before we persisted the state for a checkpoint. This is due to how checkpoint barriers are emitted, I spare you the details*. Does the data i

Re: Beam Digital Summit 2020 -- JUNE 2020!

2020-04-22 Thread Maximilian Michels
🎉 Looking forward to this! Cheers, Max On 22.04.20 21:09, Austin Bennett wrote: > Hi All, > > We are excited to announce the Beam Digital Summit 2020! > > This will occur for partial days during the week of 15-19 June. > > CfP is open and found: https://sessionize.com/beam-digital-summit-2020/

Re: Running NexMark Tests

2020-04-22 Thread Maximilian Michels
The flag is needed when checkpointing is enabled because Flink is unable to create a new checkpoint when not all operators are running. By default, operators shut down when all input has been read. That will trigger sending out the maximum (final) watermark at the sources. The flag name is a bit c

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-22 Thread Maximilian Michels
I assume this will impact the Exactly Once Semantics that beam provided > as in the KafkaExactlyOnceSink, the processElement method is also > annotated with @RequiresStableInput? > > Thanks a lot! > Eleanore > > On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels &

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread Maximilian Michels
Hi Stephen, Thanks for reporting the issue! David, good catch! I think we have to resort to only using a single state cell for buffering on checkpoints, instead of using a new one for every checkpoint. I was under the assumption that, if the state cell was cleared, it would not be checkpointed bu

Re: FlinkStateBackendFactory

2020-03-11 Thread Maximilian Michels
nding on your use case. > > Thanks a lot! > Eleanore > > On Thu, Mar 5, 2020 at 12:46 AM Maximilian Michels <mailto:m...@apache.org>> wrote: > > Hi Eleanore, > > Good question. I think the easiest way is to configure this in the > Flink >

Re: FlinkStateBackendFactory

2020-03-05 Thread Maximilian Michels
Hi Eleanore, Good question. I think the easiest way is to configure this in the Flink configuration file, i.e. flink-conf.yaml. Then you don't need to set anything in Beam. If you want to go with your approach, then just use getClass().getClassLoader() unless you have some custom classloader

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Maximilian Michels
20 at 1:36 PM Maximilian Michels <mailto:m...@apache.org>> wrote: Hi Tobi, That makes sense to me. My argument was coming from having "exactly-once" semantics for a pipeline. In this regard, the stop functionality does not help. But I think having the option to

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Maximilian Michels
ght now 31), rollout a new Flink version and start them at the point where they left of with their last committed offset in Kafka. Does that make sense? Best, Tobi On Sun, Mar 1, 2020 at 5:23 PM Maximilian Michels <mailto:m...@apache.org>> wrote: In some sense, stop is differ

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-01 Thread Maximilian Michels
-Modified: Thu, 27 Feb 2020 12:43:36 GMT Connection: keep-alive content-length: 2137 Best, Tobias On Fri, Feb 28, 2020 at 10:29 AM Maximilian Michels mailto:m...@apache.org>> wrote: The stop functionality has been removed in Beam. It was semantically

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-28 Thread Maximilian Michels
y to spin it up and test Best, Tobi On Mon, Feb 24, 2020 at 10:13 PM Maximilian Michels mailto:m...@apache.org>> wrote: Thank you for repor

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-24 Thread Maximilian Michels
Thank you for reporting / filing / collecting the issues. There is a fix pending: https://github.com/apache/beam/pull/10950 As for the upgrade issues, the 1.8 and 1.9 upgrade is trivial. I will check out the Flink 1.10 PR tomorrow. Cheers, Max On 24.02.20 09:26, Ismaël Mejía wrote: We are cu

Re: [BEAM] How does BEAM translate AccumulationMode to Flink Runner implementation?

2020-01-21 Thread Maximilian Michels
Hi Tison, Beam has its own set of libraries to implement windowing. Hence, the Flink Runner does not use Flink's windowing but deploys Beam's windowing logic within a Flink operator. If you want to look in the code, have a look at WindowDoFnOperator. Cheers, Max On 21.01.20 10:35, tison wro

Re: [Interactive Beam] Changes to local pipeline executions

2019-12-18 Thread Maximilian Michels
Thanks for the heads-up, Ning! I haven't tried out interactive Beam, but this puts it back on my radar :) Cheers, Max On 04.12.19 20:45, Ning Kang wrote: *If you are not an Interactive Beam user, you can ignore this email.* * * Hi Interactive Beam users, We've recently made some changes to ho

Re: List Admins: Please unsubscribe me (automatic unsubscribe fails)

2019-12-16 Thread Maximilian Michels
Hi Benjamin, There is a way to do this yourself: To unsubscribe a different e-mail - e.g. you used to be subscribed as user@oldname.example - send a message to list-unsubscribe-user=oldname.exam...@apache.org Source: https://apache.org/foundation/mailinglists.html So send an email to: lis

Re: No filesystem found for scheme s3 using FileIO

2019-12-11 Thread Maximilian Michels
9, 10:34 AM, "Maximilian Michels" mailto:m...@apache.org>> wrote:     Hi Preston,     Sorry about the name mixup, of course I meant to write Preston not     Magnus :) See my reply below.     cheers,     Max     On 25.09.19 08:31, Maximi

Re: How side input will work for streaming application in apache beam.

2019-10-22 Thread Maximilian Michels
Hi Jitendra, Side inputs are materialized based on their windowing. If you assign a 10 second window to the side inputs, they can be renewed every 10 seconds. Whenever you access the side input, the newest instance of the side input will be retrieved. Cheers, Max On 22.10.19 10:42, jitendra

Re: Python Portable Runner Issues

2019-10-01 Thread Maximilian Michels
Probably the most stable is running on Dataflow still. But I’m excited to see the progress towards a Spark runner, can’t wait to try TFT on it :) That is debatable. It is also hard to compare because Dataflow is a managed service, whereas you'll have to spin up your own cluster for other Runn

Re: No filesystem found for scheme s3 using FileIO

2019-09-25 Thread Maximilian Michels
Hi Preston, Sorry about the name mixup, of course I meant to write Preston not Magnus :) See my reply below. cheers, Max On 25.09.19 08:31, Maximilian Michels wrote: Hi Magnus, Your observation seems to be correct. There is an issue with the file system registration. The two types of

Re: No filesystem found for scheme s3 using FileIO

2019-09-25 Thread Maximilian Michels
;m not too familiar about Flink sure why S3 is not properly being registered when running the Flink job. Ccing some folks who are more familiar about Flink. +Ankur Goenka <mailto:goe...@google.com> +Maximilian Michels <mailto:m...@apache.org> Thanks, Cham On Sat, Sep 21, 2019 at 9:1

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Maximilian Michels
That's even better. On 19.09.19 16:35, Robert Bradshaw wrote: On Thu, Sep 19, 2019 at 4:33 PM Maximilian Michels wrote: This is obviously less than ideal for the user... Should we "fix" the Java SDK? Of is the long-terms solution here to have runners do this rewrite? I th

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Maximilian Michels
paths for Reads. On 19.09.19 11:46, Robert Bradshaw wrote: On Thu, Sep 19, 2019 at 11:22 AM Maximilian Michels wrote: The flag is insofar relevant to the PortableRunner because it affects the translation of the pipeline. Without the flag we will generate primitive Reads which are unsupported in p

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Maximilian Michels
be by just setting the flag in Dataflow runner. It requires some amount of clean up. I agree that it would be good to clean this up, and I also agree to not rush this especially if this is not currently impacting users. Ahmet On Wed, Sep 18, 2019 at 12:56 PM Maximilian Michels <mailt

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-18 Thread Maximilian Michels
. Kyle Weaver | Software Engineer | github.com/ibzib <http://github.com/ibzib> | kcwea...@google.com <mailto:kcwea...@google.com> On Tue, Sep 17, 2019 at 3:38 PM Ahmet Altay <mailto:al...@google.com>> wrote: On Tue, Sep 17, 2019 at 2:26 PM Maximilian Mi

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Maximilian Michels
here are for the Create and Read transforms.) > On Tue, Sep 17, 2019 at 11:33 AM Maximilian Michels mailto:m...@apache.org>> wrote: >> >> +dev >> >> The beam_fn_api flag and

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Maximilian Michels
+dev The beam_fn_api flag and the way it is automatically set is error-prone. Is there anything that prevents us from removing it? I understand that some Runners, e.g. Dataflow Runner have two modes of executing Python pipelines (legacy and portable), but at this point it seems clear that the

Re: How do I run Beam Python pipelines using Flink deployed on Kubernetes?

2019-09-11 Thread Maximilian Michels
Hi Andrea, You could use the new worker_pool_main.py which was developed for the Kubernetes use case. It works together with the external environment factory. Cheers, Max On 11.09.19 18:51, Lukasz Cwik wrote: Yes. On Wed, Sep 11, 2019 at 3:12 AM Andrea Medeghini

Re: Hackathon @BeamSummit @ApacheCon

2019-08-29 Thread Maximilian Michels
Hey, I'm in as well! Austin and I recently talked about how we could organize the hackathon. Likely it will be an hour per day for exchanging ideas and learning about Beam. For example, there has been interest from the Apache Streams project to discuss points for collaboration. We will soon

Re: [External] Re: --state_backend PipelineOption not supported in python when running on Flink

2019-08-28 Thread Maximilian Michels
Maximilian Michels <mailto:m...@apache.org>> wrote: Hi Catlyn, This option has never worked outside the Java SDK where it originates from. For the upcoming Beam 2.16.0 release, we have replaced this option with a factory class: https://github.com/apache/

Re: --state_backend PipelineOption not supported in python when running on Flink

2019-08-27 Thread Maximilian Michels
Hi Catlyn, This option has never worked outside the Java SDK where it originates from. For the upcoming Beam 2.16.0 release, we have replaced this option with a factory class: https://github.com/apache/beam/blob/da6c1a8f435f5583811785050808a2311db94047/runners/flink/src/main/java/org/apache/be

Re: Using Beam 2.14.0 + Flink 1.8 with RocksDB state backend - perhaps missing dependency?

2019-08-14 Thread Maximilian Michels
ias > mailto:tobias.kay...@ricardo.ch>> > wrote: > > * each time :) > > On Mon, Aug 12, 2019 at 9:48 PM Kaymak, Tobias > <mailto:tobias.kay...@ricardo.ch>> wrote:

Dropping support for Flink 1.5/1.6

2019-08-13 Thread Maximilian Michels
Dear users of the Flink Runner, Just a heads-up that we may remove Flink 1.5/1.6 support for future version of Beam (>=2.16.0): https://jira.apache.org/jira/browse/BEAM-7962 Why? We support 1.5 - 1.8 and have a 1.9 version in the making. The increased number of supported versions manifests in a m

Re: Using Beam 2.14.0 + Flink 1.8 with RocksDB state backend - perhaps missing dependency?

2019-08-12 Thread Maximilian Michels
Hi Tobias! I've checked if there were any relevant changes to the RocksDB state backend in 1.8.1, but I couldn't spot anything. Could it be that an old version of RocksDB is still in the Flink cluster path? Cheers, Max On 06.08.19 16:43, Kaymak, Tobias wrote: > And of course the moment I click

Re: [python] ReadFromPubSub broken in Flink

2019-07-13 Thread Maximilian Michels
he source along the lines of the KafkaIO PRs to >work >with PubSubIO, or is it already supported via some flag? > >-chad > > >On Sat, Jul 13, 2019 at 8:43 AM Maximilian Michels >wrote: > >> Hi Chad, >> >> This stub will only be replaced by the Dataflow serv

Re: [python] ReadFromPubSub broken in Flink

2019-07-13 Thread Maximilian Michels
Hi Chad, This stub will only be replaced by the Dataflow service. It's an artifact of the pre-portability era. That said, we now have the option to replace ReadFromPubSub with an external transform which would utilize Java's PubSubIO via the new cross-language feature. Thanks, Max On 12.07.1

Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Maximilian Michels
n, for such crucial behavior i would expect the pipeline >     to fail with a clear message stating the reason, like in the same >     way when you implement a new Codec and forget to override the >     verifyDeterministic method (don't recall the right name of it). > >    

Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Maximilian Michels
This has come up before: https://issues.apache.org/jira/browse/BEAM-4520 The issue is that checkpoints won't be acknowledged if checkpointing is disabled in Flink. We throw a WARN when unbounded sources are used without checkpointing. Not all unbounded sources actually need to finalize checkpoin

Re: Parallel computation of windows in Flink

2019-06-12 Thread Maximilian Michels
ferent - all > windows are considered independent and there isn't a global state. I'm > not 100% sure, but it seems like the translation from Beam to Flink > could potentially include the windows in the key of `.keyBy` to allow > parallelizing window computations. Does this

Re: Parallel computation of windows in Flink

2019-06-11 Thread Maximilian Michels
c of the data. > >     If possible, can you share your pipeline code at a high level. > >     On Mon, Jun 10, 2019 at 12:58 PM Mike Kaplinskiy >     mailto:m...@ladderlife.com>> wrote: > > >     Ladder <http://bit.ly/1VRtWfS>. The smart, modern way to insure >     y

Re: Python sdk performance

2019-06-10 Thread Maximilian Michels
Hi Mingliang, You can increase the parallelism of the Python SDK Harness via the pipeline option   --experimental worker_threads= Note that the workers are Python threads which suffer from the Global Interpreter Lock. We currently do not use real processes, e.g. via multiprocessing. There is

Re: Parallel computation of windows in Flink

2019-06-10 Thread Maximilian Michels
Hi Mike, If you set the number of shards to 1, you should get one shard per window; unless you have "ignore windows" set to true. > (The way I'm checking this is via the Flink UI) I'm curious, how do you check this via the Flink UI? Cheers, Max On 09.06.19 22:33, Mike Kaplinskiy wrote: > Hi e

Re: [ANNOUNCE] Apache Beam 2.13.0 released!

2019-06-10 Thread Maximilian Michels
Thanks for managing the release, Ankur! @Chad Thanks for the feedback. I agree that we can improve our release notes. The particular issue you were looking for was part of the detailed list [1] linked in the blog post: https://jira.apache.org/jira/browse/BEAM-7029 Cheers, Max [1] https://jira

Re: Question about --environment_type argument

2019-05-29 Thread Maximilian Michels
rs/flink/FlinkJobServerDriver.java#L63 >> >> On Tue, May 28, 2019 at 11:39 AM 青雉(祁明良) > <mailto:m...@xiaohongshu.com>> wrote: >>> >>> Yes, I did (2). Since the job server successfully >>> created the artifact directory, I think I did it cor

Re: Question about --environment_type argument

2019-05-28 Thread Maximilian Michels
e the artifact directory is successfully created at HDFS by job server, but fails at task manager when reading. Best, Mingliang 获取 Outlook for iOS <https://aka.ms/o0ukef> On Tue, May 28, 2019 at 11:47 PM +0800, "Maximilian Michels" mailto:m...@apache.org>> wrote: Rece

Re: Question about --environment_type argument

2019-05-28 Thread Maximilian Michels
Recent versions of Flink do not bundle Hadoop anymore, but they are still "Hadoop compatible". You just need to include the Hadoop jars in the classpath. Beams's Hadoop does not bundle Hadoop either, it just provides Beam file system abstractions which are similar to Flink "Hadoop compatibilit

Re: Question about --environment_type argument

2019-05-27 Thread Maximilian Michels
d language. Cheers, Mingliang On 27 May 2019, at 6:53 PM, Maximilian Michels wrote: Hi Mingliang, The environment is created for each TaskManager. For docker, will it create one docker per flink taskmanager? Yes. For process, does it mean start a python process to run the user co

Re: How to setup portable flink runner with remote flink cluster

2019-05-27 Thread Maximilian Michels
LOOPBACK is a special testing URN and won't work unless you start a Python SDK Harness worker. Like Kyle said, I would just use the default. You will have to have support for Docker on your deployment machines. Or work around this using a process-based environment. Thanks, Max On 24.05.19 17

Re: Question about --environment_type argument

2019-05-27 Thread Maximilian Michels
Hi Mingliang, The environment is created for each TaskManager. For docker, will it create one docker per flink taskmanager? Yes. For process, does it mean start a python process to run the user code? And it seems "command" should be set in the environment config, but what should it

Re: Problem running a pipeline on a remote Flink cluster

2019-05-14 Thread Maximilian Michels
when I tested it for the WordCount example it ran without problems. Also, it runs successfully if I run on [local], if I am using incompatible versions this should fail right? Regards, Jorik -Original Message- From: Maximilian Michels Sent: Monday, May 13, 2019 12:37 PM To: user

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-14 Thread Maximilian Michels
Just saw that the malformed master URL was due to HTML formatting. It looks ok. Please check your Flink JobManager logs. The JobManager might not reachable and the submission is just blocked on it becoming available. Thanks, Max On 13.05.19 20:05, Maximilian Michels wrote: Hi Averell

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-13 Thread Maximilian Michels
e more time and then come back if I am still stuck with it. Again, thanks a lot for your help. Regards, Averell On Sat, 11 May 2019, 12:35 am Maximilian Michels, mailto:m...@apache.org>> wrote: Hi Averell, What you want to do is possible today but at this point

Re: Problem running a pipeline on a remote Flink cluster

2019-05-13 Thread Maximilian Michels
Hi Jorik, It looks like the version of the Flink Runner and the Flink cluster version do not match. For example, if you use Flink 1.7, make sure to use the beam-runners-flink-1.7 artifact. For more information: https://beam.apache.org/documentation/runners/flink/ Thanks, Max On 12.05.19 18:

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-10 Thread Maximilian Michels
Hi Averell, What you want to do is possible today but at this point is an early experimental feature. The reason for that is that Kafka is a cross-language Java transform in a Python pipeline. We just recently enabled cross-language pipelines. 1. First of all, until 2.13.0 is released you wi

Re: Apache BEAM on Flink in production

2019-05-08 Thread Maximilian Michels
Hi Stephen, Apart from Lyft there are many Beam users on Flink. I'm not sure I can publicly say their names, but I have been working with different companies running multiple production pipelines with Beam on Flink. Since you mentioned ETL, most of them are actually using streaming pipelines

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-03 Thread Maximilian Michels
x On 02.05.19 22:44, Jan Lukavský wrote: Hi Max, comments inline. On 5/2/19 3:29 PM, Maximilian Michels wrote: Couple of comments: * Flink transforms It wouldn't be hard to add a way to run arbitrary Flink operators through the Beam API. Like you said, once you go down that road, you

  1   2   >