Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Matthew Ouyang
I have some other work-related things I need to do this week, so I will likely report back on this over the weekend. Thank you for the explanation. It makes perfect sense now. On Tue, Jun 1, 2021 at 11:18 PM Reuven Lax wrote: > Some more context - the problem is that RenameFields outputs (in t

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Brian Hulette
+dev > I bet the DirectRunner is encoding and decoding in between, which fixes the object. Do we need better testing of schema-aware (and potentially other built-in) transforms in the face of fusion to root out issues like this? Brian On Wed, Jun 2, 2021 at 5:13 AM Matthew Ouyang wrote: > I

unsuscribe

2021-06-02 Thread Pasan Kamburugamuwa
Please can you unsubscribe me. Thank you Pasan

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Reuven Lax
I don't think this bug is schema specific - we created a Java object that is inconsistent with its encoded form, which could happen to any transform. This does seem to be a gap in DirectRunner testing though. It also makes it hard to test using PAssert, as I believe that puts everything in a side

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Ahmet Altay
/cc @Boyuan Zhang for kafka @Chamikara Jayalath for multi language might be able to help. On Tue, Jun 1, 2021 at 9:39 PM Alex Koay wrote: > Hi all, > > I have created a simple snippet as such: > > import apache_beam as beam > from apache_beam.io.kafka import ReadFromKafka > from apache_beam.op

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Chamikara Jayalath
What error did you run into with Dataflow ? Did you observe any errors in worker logs ? If you follow the steps given in the example here it should work. Make sure Dataflow workers have access to Kafka bootstrap servers you p

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Alex Koay
Yeah, I figured it wasn't supported correctly on DirectRunner. Stumbled upon several threads saying so. On Dataflow, I've encountered a few different kinds of issues. 1. For the kafka_taxi example, the pipeline would start, the PubSub to Kafka would run, but nothing gets read from Kafka (this seem

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Alex Koay
CC-ing Chamikara as he got omitted from the reply all I did earlier. On Thu, Jun 3, 2021 at 12:43 AM Alex Koay wrote: > Yeah, I figured it wasn't supported correctly on DirectRunner. Stumbled > upon several threads saying so. > > On Dataflow, I've encountered a few different kinds of issues. > 1

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Chamikara Jayalath
Can you mention the Job Logs you see in the Dataflow Cloud Console page for your job ? Can you also mention the pipeline and configs you used for Dataflow (assuming it's different from what's given in the example) ? Make sure that you used Dataflow Runner v2 (as given in the example). Are you provi

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Robert Bradshaw
If you want to control the total number of elements being processed across all workers at a time, you can do this by assigning random keys of the form RandomInteger() % TotalDesiredConcurrency followed by a GroupByKey. If you want to control the number of elements being processed in parallel per V

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Vincent Marquez
On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw wrote: > If you want to control the total number of elements being processed > across all workers at a time, you can do this by assigning random keys > of the form RandomInteger() % TotalDesiredConcurrency followed by a > GroupByKey. > > If you want

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Robert Bradshaw
On Wed, Jun 2, 2021 at 11:18 AM Vincent Marquez wrote: > > On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw wrote: >> >> If you want to control the total number of elements being processed >> across all workers at a time, you can do this by assigning random keys >> of the form RandomInteger() % To

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Kenneth Knowles
Would it be caught by CoderProperties? Kenn On Wed, Jun 2, 2021 at 8:16 AM Reuven Lax wrote: > I don't think this bug is schema specific - we created a Java object that > is inconsistent with its encoded form, which could happen to any transform. > > This does seem to be a gap in DirectRunner t

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Reuven Lax
There is no bug in the Coder itself, so that wouldn't catch it. We could insert CoderProperties.coderDecodeEncodeEqual in a subsequent ParDo, but if the Direct runner already does an encode/decode before that ParDo, then that would have fixed the problem before we could see it. On Wed, Jun 2, 2021

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Brian Hulette
Could the DirectRunner just do an equality check whenever it does an encode/decode? It sounds like it's already effectively performing a CoderProperties.coderDecodeEncodeEqual for every element, just omitting the equality check. On Wed, Jun 2, 2021 at 12:04 PM Reuven Lax wrote: > There is no bug

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Kenneth Knowles
Mutability checking might catch that. I meant to suggest not putting the check in the pipeline, but offering a testing discipline that will catch such issues. One thing that's been on the back burner for a long time is making CoderProperties into a CoderTester like Guava's EqualityTester. Then it

Allyship workshops for open source contributors

2021-06-02 Thread Aizhamal Nurmamat kyzy
Hi Beamers, Would this community be interested in taking the Allyship Training? It requires a 90min commitment for remote session learning. If we have a good number of people who express interest in this thread, I will set up training for the Airflow community. If we don't have the critical mass,

Re: Allyship workshops for open source contributors

2021-06-02 Thread Aizhamal Nurmamat kyzy
> > If we have a good number of people who express interest in this thread, I > will set up training for the Airflow community. > I meant Beam ^^' I am organizing it for the Airflow community as well.

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Brian Hulette
> One thing that's been on the back burner for a long time is making CoderProperties into a CoderTester like Guava's EqualityTester. Reuven's point still applies here though. This issue is not due to a bug in SchemaCoder, it's a problem with the Row we gave SchemaCoder to encode. I'm assuming a Co

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Vincent Marquez
On Wed, Jun 2, 2021 at 11:27 AM Robert Bradshaw wrote: > On Wed, Jun 2, 2021 at 11:18 AM Vincent Marquez > wrote: > > > > On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw > wrote: > >> > >> If you want to control the total number of elements being processed > >> across all workers at a time, you

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread OrielResearch Eila Arich-Landkof
Hi Roberts, Thank you. I usually work with the custom worker configuration options I will custom it to low number of cores with large memory and see if it solves my problem Thanks so much, — Eila www.orielresearch.com https://www.meetup.com/Deep-Learning-In-Production Sent from my iPhone > On

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Alex Koay
Finally figured out the issue. Can confirm that the kafka_taxi job is working as expected now. The issue was that I ran the Dataflow job with an invalid experiments flag (runner_v2 instead of use_runner_v2), and I was getting logging messages (on 2.29) that said that I was using Runner V2 even thou