Hi folks, I've got the following use case, where I need to read data from HDFS and publish the data to Kafka, such that it can be reprocessed by another job.
I've searched the web and read the docs. This has turned up no and concrete examples or information of how this is achieved, or even if it's possible at all. Further context: 1. Flink will be deployed to Kubernetes. 2. Kerberos is active on Hadoop. 3. The data is stored on HDFS as Avro. 4. I cannot install Flink on our Hadoop environment. 5. No stateful computations will be performed. I've noticed that the flink-avro package provides a class called AvroInputFormat<T>, with a nullable path field, and I think this is my goto. Apologies for the poor formatting ahead, but the code I have in mind looks something like this: StreamingExecutionEnvironment env = ...; AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class); DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data"); // rest, + publishing to Kafka using the FlinkKafkaProducer My major questions and concerns are: 1. Is it possible to use read from HDFS using the StreamingExecutionEnvironment object? I'm planning on using the Data Stream API because of point (2) below. 2. Because Flink will be deployed on Kubernetes, I have a major concern that if I were to use the Data Set API, once Flink completes and exits, the pods will restart, causing unnecessary duplication of data. Is the pod restart a valid concern? 3. Is there anything special I need to be worried about regarding Kerberos in this instance? The key tab will be materialised on the pods upon start up. 4. Is this even a valid approach? The dataset I need to read and replay is small (12 TB). Any help, even in part will be appreciated. Kind regards, Damien