BEAM-2217 NotImplementedError - DataflowRunner parsing Protos from PubSub (Python SDK)

2020-06-08 Thread Lien Michiels
Hi everyone, First time writing the email list, so please tell me if I'm doing this all wrong. I'm building a streaming pipeline to be run on the DataflowRunner that reads from PubSub and writes to BQ using the Python 3 SDK. I can get the pipeline started fine with the DirectRunner, but as soon

Re: Error restoring Flink checkpoint

2020-06-08 Thread Ivan San Jose
Finally I've managed to modify Beam's AvroCoder in order not to serialize any Class reference of the object to be encoded/decoded, and could successfully restore a checkpoint after adding a field to the POJO model. I think it would be useful for everyone as current AvroCoder is not really useful wh

Re: Error restoring Flink checkpoint

2020-06-08 Thread Reuven Lax
Max, can you explain why Flink serializes the coders in the checkpoint? Dataflow on update uses the new graph, so doesn't hit this problem. On Mon, Jun 8, 2020 at 7:21 AM Ivan San Jose wrote: > Finally I've managed to modify Beam's AvroCoder in order not to > serialize any Class reference of the

Re: Error restoring Flink checkpoint

2020-06-08 Thread Ivan San Jose
This simple test breaks: import java.time.Instant; public class PacoTest { private static class Pojo { Instant instant; public Pojo() { this.instant = Instant.now(); } } @Test public void paco() throws IOException { Coder coder = AvroC

Re: Error restoring Flink checkpoint

2020-06-08 Thread Ivan San Jose
Hi Reuven, as far I've understood, Apache Beam coders are wrapped into Flink's TypeSerializers, so they are being serialized as part of the chceckpoint according to https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializerConfigSnapsho

Re: Error restoring Flink checkpoint

2020-06-08 Thread Reuven Lax
Maybe instead of wapping the serialized Coder in the TypeSerializer, we could wrap the Coder URN instead? On Mon, Jun 8, 2020 at 7:37 AM Ivan San Jose wrote: > Hi Reuven, as far I've understood, Apache Beam coders are wrapped into > Flink's TypeSerializers, so they are being serialized as part o

Re: BEAM-2217 NotImplementedError - DataflowRunner parsing Protos from PubSub (Python SDK)

2020-06-08 Thread Brian Hulette
Hi Lien, > First time writing the email list, so please tell me if I'm doing this all wrong. Not at all! This is exactly the kind of question this list is for I have a couple of questions that may help us debug: - Can you share the full stacktrace? - What version of Beam are you using? There wer

Re: Writing pipeline output to google sheet in google drive

2020-06-08 Thread Luke Cwik
It doesn't look like BigQuery supports exporting to Google sheet[1], maybe you can invoke this BQ connector directly by adding a transform that follows the BQ sink. 1: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations On Sat, Jun 6, 2020 at 8:31 PM OrielResearch Eila Arich-

Re: Writing pipeline output to google sheet in google drive

2020-06-08 Thread Austin Bennett
@OrielResearch Eila Arich-Landkof Depending on your needs, I wonder about establishing a sheet (or sheets, as needed) that has a BQ connector for the datasource of it. If you use Dataflow to write/create a BQ table, that would then hydrate the sheet (not sure the ordering -- maybe you'd need to

Ensuring messages are processed and emitted in-order

2020-06-08 Thread Hadi Zhang
We are using the Beam 2.20 Python SDK on a Flink 1.9 runner. Our messages originate from a custom source that consumes messages from a Kafka topic and emits them in the order of their Kafka offsets to a DoFn. After this DoFn processes the messages, they are emitted to a custom sink that sends messa

Re: Using KinesisIO to put records in Firehose

2020-06-08 Thread Alexey Romanenko
+ cross+posting to user@beam.apache.org Under the hood, KinesisIO uses IKinesisProducer to write to Kinesis stream. I’m not very familiar with Firehose API, so I’m not sure that KPL supports writes directly to Firehose. In the same time, Kinesis Stream can be used

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Piotr Filipiuk
Pasting the error inline: ERROR:root:severity: ERROR timestamp { seconds: 1591405163 nanos: 81500 } message: "Client failed to dequeue and process the value" trace: "org.apache.beam.sdk.util.UserCodeException: java.lang.NoClassDefFoundError: org/springframework/expression/EvaluationContext

Streaming Beam jobs keep restarting on Spark/Kubernetes?

2020-06-08 Thread Joseph Zack
Anybody out there running Beam on Spark? I am pulling data from a Kafka topic with KafkaIO, but the job keeps restarting. There is no error, it just 1. creates the driver 2. creates the executors 3. runs for a few seconds 4. terminates the executors 5. terminates the driver

Re: Streaming Beam jobs keep restarting on Spark/Kubernetes?

2020-06-08 Thread Kyle Weaver
> There is no error Are you sure? That sounds like a crash loop to me. It might take some digging through various Kubernetes logs to find the cause. Can you provide more information about how you're running the job? On Mon, Jun 8, 2020 at 1:50 PM Joseph Zack wrote: > Anybody out there running

Re: Streaming Beam jobs keep restarting on Spark/Kubernetes?

2020-06-08 Thread Joseph Zack
I'm submitting the job via an operator provided by Google Cloud Platform. Here's a rough sample showing how I do it - though this is just with the wordcount sample: https://github.com/THEjoezack/beam-on-spark-on-kubernetes The driver shows as "Completed" before it starts again, but I'll dig deepe

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Chamikara Jayalath
Seems like Java dependency is not being properly set up when running the cross-language Kafka step. I don't think this was available for Beam 2.21. Can you try with the latest Beam HEAD or Beam 2.22 when it's released ? +Heejong Lee On Mon, Jun 8, 2020 at 12:39 PM Piotr Filipiuk wrote: > Pastin

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Chamikara Jayalath
To clarify, Kafka dependency was already available as an embedded dependency in Java SDK Harness but not sure if this worked for DirectRunner. starting 2.22 we'll be staging dependencies from the environment during pipeline submission. On Mon, Jun 8, 2020 at 3:23 PM Chamikara Jayalath wrote: > S

Beam/Python/Flink: Unable to deserialize UnboundedSource for PubSub source

2020-06-08 Thread Pradip Thachile
Hey folks, I posted this on the Flink user mailing list but didn't get any traction there (potentially since this is Beam related?). I've got a Beam/Python pipeline that works on the DirectRunner and now am trying to run this on a local dev Flink cluster. Running this yields an error out the g

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Heejong Lee
DirectRunner is not well-tested for xlang transforms and you need to specify jar_packages experimental flag for Java dependencies from Python SDK. I'd recommend using 2.22 + FlinkRunner for xlang pipelines. On Mon, Jun 8, 2020 at 3:27 PM Chamikara Jayalath wrote: > To clarify, Kafka dependency w