Strange graph generated by beam

2020-06-22 Thread 吴亭
Hi, We are using the beam to read the stream from Kafka and using spark as the runner, and I found our spark graph looks very strange. For each batch stream, it will generate 3 stages, one of them of our actual work, that I can understand. Another two is kind of duplicated, you can take a look a

Re: Strange graph generated by beam

2020-06-22 Thread Luke Cwik
It would be helpful if the names of the transforms were visible on the graph otherwise it is really hard understanding what each stage and step do. On Mon, Jun 22, 2020 at 3:53 AM 吴亭 wrote: > Hi, > > We are using the beam to read the stream from Kafka and using spark as the > runner, and I found

Re: Designing an existing pipeline in Beam

2020-06-22 Thread Luke Cwik
Who reads map 1? Can it be stale? It is unclear what you are trying to do in parallel and why you wouldn't stick all this logic into a single DoFn / stateful DoFn. On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hello Everyone, > > I am in the process o

Flink/Portable Runner error on AWS EMR

2020-06-22 Thread Jesse Lord
I am trying to run the wordcount quickstart example on a flink cluster on AWS EMR. Beam version 2.22, Flink 1.10. I get the following error: ERROR:root:java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule no

Re: KafkaIO Exactly once vs At least Once

2020-06-22 Thread Alexey Romanenko
I think you don’t need to enable EOS in this case since KafkaIO has a dedicated EOS-sink implementation for Beam part (btw, it’s not supported by all runners) and it relies on setting “enable.idempotence=true” for KafkaProducer. I’m not sure that you can achieve “at least once” semantics with cur

Re: Designing an existing pipeline in Beam

2020-06-22 Thread Praveen K Viswanathan
Hi Luke, We can say Map 2 as a kind of a template using which you want to enrich data in Map 1. As I mentioned in my previous post, this is a high level scenario. All these logic are spread across several classes (with ~4K lines of code in total). As in any Java application, 1. The code has been

Re: Designing an existing pipeline in Beam

2020-06-22 Thread Praveen K Viswanathan
Another way to put this question is, how do we write a beam pipeline for an existing pipeline (in Java) that has a dozen of custom objects and you have to work with multiple HashMaps of those custom objects in order to transform it. Currently, I am writing a beam pipeline by using the same Custom o

Re: KafkaIO Exactly once vs At least Once

2020-06-22 Thread Eleanore Jin
Hi Alexey, Thanks for the response, below are some of my follow up question: the KafkaIO EOS-sink you referred is: KafkaExactlyOnceSink? If so, the way I understand to use is KafkaIO.write().withEOS(numPartitionsForSinkTopic, "mySlinkGroupId"), reading from your response, do I need additionally c

Re: Error restoring Flink checkpoint

2020-06-22 Thread Ivan San Jose
Hi again, just replying here in case this could be useful for someone as using Flink checkpoints on Beam is not realiable at all right now... Even I removed class references to the serialized object in AvroCoder, finally I couldn't make AvroCoder work as it is inferring schema using ReflectData cla

Re: Error restoring Flink checkpoint

2020-06-22 Thread Reuven Lax
Yes, I agree that serializing coders into the checkpoint creates problems. I'm wondering whether it is possible to serialize the coder URN + args instead. On Mon, Jun 22, 2020 at 11:00 PM Ivan San Jose wrote: > Hi again, just replying here in case this could be useful for someone > as using Flin

Re: Error restoring Flink checkpoint

2020-06-22 Thread Ivan San Jose
I don't really know, my knowledge about Beam source code is not so deep, but I managed to modify AvroCoder in order to store a string containing class name (and class parameters in case it was a parametrized class) instead of references to Class. But, as I said, AvroCoder is using AVRO's ReflectDat