Hi, 1. Yes, StreamingExecutionEnvironment.readFile can be used for files on HDFS 2. I think this is a valid concern. Besides that, there are plans to deprecate DataSet API [1] 4. Yes, the approach looks good
I'm pulling in Aljoscha for your 3rd question (and probably some clarifications on others). [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741 Regards, Roman On Fri, Sep 25, 2020 at 12:50 PM Damien Hawes <marley.ha...@gmail.com> wrote: > Hi folks, > > I've got the following use case, where I need to read data from HDFS and > publish the data to Kafka, such that it can be reprocessed by another job. > > I've searched the web and read the docs. This has turned up no and > concrete examples or information of how this is achieved, or even if it's > possible at all. > > Further context: > > 1. Flink will be deployed to Kubernetes. > 2. Kerberos is active on Hadoop. > 3. The data is stored on HDFS as Avro. > 4. I cannot install Flink on our Hadoop environment. > 5. No stateful computations will be performed. > > I've noticed that the flink-avro package provides a class called > AvroInputFormat<T>, with a nullable path field, and I think this is my goto. > > Apologies for the poor formatting ahead, but the code I have in mind looks > something like this: > > > > StreamingExecutionEnvironment env = ...; > AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class); > DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data"); > // rest, + publishing to Kafka using the FlinkKafkaProducer > > > > My major questions and concerns are: > > 1. Is it possible to use read from HDFS using the > StreamingExecutionEnvironment object? I'm planning on using the Data Stream > API because of point (2) below. > 2. Because Flink will be deployed on Kubernetes, I have a major concern > that if I were to use the Data Set API, once Flink completes and exits, the > pods will restart, causing unnecessary duplication of data. Is the pod > restart a valid concern? > 3. Is there anything special I need to be worried about regarding Kerberos > in this instance? The key tab will be materialised on the pods upon start > up. > 4. Is this even a valid approach? The dataset I need to read and replay is > small (12 TB). > > Any help, even in part will be appreciated. > > Kind regards, > > Damien > > > >