Hi folks,

I've got the following use case, where I need to read data from HDFS and
publish the data to Kafka, such that it can be reprocessed by another job.

I've searched the web and read the docs. This has turned up no and concrete
examples or information of how this is achieved, or even if it's possible
at all.

Further context:

1. Flink will be deployed to Kubernetes.
2. Kerberos is active on Hadoop.
3. The data is stored on HDFS as Avro.
4. I cannot install Flink on our Hadoop environment.
5. No stateful computations will be performed.

I've noticed that the flink-avro package provides a class called
AvroInputFormat<T>, with a nullable path field, and I think this is my goto.

Apologies for the poor formatting ahead, but the code I have in mind looks
something like this:



StreamingExecutionEnvironment env = ...;
AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
// rest, + publishing to Kafka using the FlinkKafkaProducer



My major questions and concerns are:

1. Is it possible to use read from HDFS using the
StreamingExecutionEnvironment object? I'm planning on using the Data Stream
API because of point (2) below.
2. Because Flink will be deployed on Kubernetes, I have a major concern
that if I were to use the Data Set API, once Flink completes and exits, the
pods will restart, causing unnecessary duplication of data. Is the pod
restart a valid concern?
3. Is there anything special I need to be worried about regarding Kerberos
in this instance? The key tab will be materialised on the pods upon start
up.
4. Is this even a valid approach? The dataset I need to read and replay is
small (12 TB).

Any help, even in part will be appreciated.

Kind regards,

Damien

Reply via email to