Re: Running apache_beam python sdk without c/c++ libs

2020-06-10 Thread Luke Cwik
Most runners are written in Java while others are cloud offerings which wouldn't work for your use case which limits you to use the direct runner (not meant for production/high performance applications). Beam Python SDK uses cython for performance reasons but I don't believe it strictly requires it

Re: KafkaIO Read Latency

2020-06-10 Thread Alexey Romanenko
Hi Talat, Could you elaborate what do you mean by “opening and closing bundle”? Sometimes, starting a KafkaReader can take time since it will seek for a start offset for each assigned partition but it happens only once at pipeline start-up and mostly depends on network conditions. > On 9 Jun 2

Re: KafkaIO Read Latency

2020-06-10 Thread Talat Uyarer
Hi, I check the time when StartBundle is called and do the same thing for FinishBundle then take the difference between Start and Finish Bundle times and report bundle latency. I put this metric on a step(KafkaMessageExtractor) which is right after the KafkaIO step. I dont know if this is related,

Re: Running apache_beam python sdk without c/c++ libs

2020-06-10 Thread Luke Cwik
I'm not sure. It depends on whether the Spark -> Beam Python integration will interfere with the magic built into AWS Glue. On Wed, Jun 10, 2020 at 8:57 AM Noah Goodrich wrote: > I was hoping to use the Spark runner since Glue is just Spark with some > magic on top. And in our specific use case,

Re: How to safely update jobs in-flight using Apache Beam on AWS EMR?

2020-06-10 Thread Dan Hill
Hi! I found great docs about Apache Beam on Dataflow (which makes sense). I was not able to find this about AWS EMR. https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline https://medium.com/google-cloud/restarting-cloud-dataflow-in-flight-9c688c49adfd

[ANNOUNCE] Beam 2.22.0 Released

2020-06-10 Thread Brian Hulette
The Apache Beam team is pleased to announce the release of version 2.22.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release her

Re: How to safely update jobs in-flight using Apache Beam on AWS EMR?

2020-06-10 Thread Luke Cwik
The runner needs to support it and I'm not aware of an EMR runner for Apache Beam let alone one that supports pipeline update. Have you tried reaching out to AWS? On Wed, Jun 10, 2020 at 11:14 AM Dan Hill wrote: > Hi! I found great docs about Apache Beam on Dataflow (which makes > sense). I wa

Re: BEAM-2217 NotImplementedError - DataflowRunner parsing Protos from PubSub (Python SDK)

2020-06-10 Thread Brian Hulette
+Udi Meiri - do you have any suggestions to resolve this? It looks like there are a couple things going wrong: - ProtoCoder doesn't have a definition for to_type_hint - DataflowRunner calls from_runner_api, which I believe is a workaround we can eventually remove (+Robert Bradshaw ), which in tur

Re: How to safely update jobs in-flight using Apache Beam on AWS EMR?

2020-06-10 Thread Dan Hill
No. I just sent AWS Support a message. On Wed, Jun 10, 2020 at 1:00 PM Luke Cwik wrote: > The runner needs to support it and I'm not aware of an EMR runner for > Apache Beam let alone one that supports pipeline update. Have you tried > reaching out to AWS? > > On Wed, Jun 10, 2020 at 11:14 AM D

Re: How to safely update jobs in-flight using Apache Beam on AWS EMR?

2020-06-10 Thread Austin Bennett
Hi Dan, AWS EMR generally runs Flink and/or Spark as supported Beam Runners. For EMR, you might want to check compatibility for versions of Beam/Flink can run, and the status of beam pipelines using either of those runners. On running Beam in AWS, had you seen: https://www.youtube.com/watch?v=eC

Re: BEAM-2217 NotImplementedError - DataflowRunner parsing Protos from PubSub (Python SDK)

2020-06-10 Thread Udi Meiri
I don't believe you need to register a coder unless you create your own coder class. Since the coder has a proto_message_type field, it could return that in to_type_hint, but I'll defer to Robert since I'm unfamiliar with that area. As a temporary workaround, please try overriding the output type

Re: How to safely update jobs in-flight using Apache Beam on AWS EMR?

2020-06-10 Thread Dan Hill
Sweet. I have not seen that video. Cool. I'm curious about how well AWS's managed services (like the Kinesis Data Analytics managed Flink runner) handle the updates. I'd guess it is best effort from the saved state (if enabled). If this is all delegated by Beam to Flink, then this is more of a