Re: Reading from HDFS and publishing to Kafka

Khachatryan Roman Sun, 27 Sep 2020 13:02:43 -0700

Hi,

1. Yes, StreamingExecutionEnvironment.readFile can be used for files on HDFS
2. I think this is a valid concern. Besides that, there are plans to
deprecate DataSet API [1]
4. Yes, the approach looks good


I'm pulling in Aljoscha for your 3rd question (and probably some
clarifications on others).

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741

Regards,
Roman


On Fri, Sep 25, 2020 at 12:50 PM Damien Hawes <marley.ha...@gmail.com>
wrote:

> Hi folks,
>
> I've got the following use case, where I need to read data from HDFS and
> publish the data to Kafka, such that it can be reprocessed by another job.
>
> I've searched the web and read the docs. This has turned up no and
> concrete examples or information of how this is achieved, or even if it's
> possible at all.
>
> Further context:
>
> 1. Flink will be deployed to Kubernetes.
> 2. Kerberos is active on Hadoop.
> 3. The data is stored on HDFS as Avro.
> 4. I cannot install Flink on our Hadoop environment.
> 5. No stateful computations will be performed.
>
> I've noticed that the flink-avro package provides a class called
> AvroInputFormat<T>, with a nullable path field, and I think this is my goto.
>
> Apologies for the poor formatting ahead, but the code I have in mind looks
> something like this:
>
>
>
> StreamingExecutionEnvironment env = ...;
> AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
> DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
> // rest, + publishing to Kafka using the FlinkKafkaProducer
>
>
>
> My major questions and concerns are:
>
> 1. Is it possible to use read from HDFS using the
> StreamingExecutionEnvironment object? I'm planning on using the Data Stream
> API because of point (2) below.
> 2. Because Flink will be deployed on Kubernetes, I have a major concern
> that if I were to use the Data Set API, once Flink completes and exits, the
> pods will restart, causing unnecessary duplication of data. Is the pod
> restart a valid concern?
> 3. Is there anything special I need to be worried about regarding Kerberos
> in this instance? The key tab will be materialised on the pods upon start
> up.
> 4. Is this even a valid approach? The dataset I need to read and replay is
> small (12 TB).
>
> Any help, even in part will be appreciated.
>
> Kind regards,
>
> Damien
>
>
>
>

Re: Reading from HDFS and publishing to Kafka

Reply via email to