Continuous Read pipeline

2020-06-12 Thread TAREK ALSALEH
Hi, I am using the Python SDK with Dataflow as my runner. I am looking at implementing a streaming pipeline that will continuously monitor a GCS bucket for incoming files and depending on the regex of the file, launch a set of transforms and write the final output back to parquet for each file

Re: Streaming Beam jobs keep restarting on Spark/Kubernetes?

2020-06-12 Thread Joseph Zack
Getting back to this, as far as I can tell the job is exiting without error, even though it sees an unbounded dataset. Below is a link to the full logs for the job run: https://pastebin.com/Rh6vTqWU The rough steps: * I submit the job via the GCP spark operator

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-12 Thread Piotr Filipiuk
Thank you for clarifying. I attempted to use FlinkRunner with 2.22 and I am getting the following error, which I am not sure how to debug: ERROR:root:java.lang.UnsupportedOperationException: The ActiveBundle does not have a registered bundle checkpoint handler. INFO:apache_beam.runners.portabilit

Building Dataflow Worker

2020-06-12 Thread Talat Uyarer
Hi, I want to build the dataflow worker on apache beram 2.19. However I faced a grpc issue. I did not change anything. Just checked release-2.19.0 branch and run build command. Could you help me understand why it does not build. [1] Additional information, Based on my limited knowledge Looks like