Usually, you should use the HDFS nameservice instead of the NameNode
hostname:port to avoid NN failover.
And you could find the supported nameservice in the hdfs-site.xml in the
key *dfs.nameservices*.


Best,
Yang

On Fri, Mar 22, 2024 at 8:33 PM Sachin Mittal <sjmit...@gmail.com> wrote:

> So, when we create an EMR cluster the NN service runs on the primary node
> of the cluster.
> Now at the time of creating the cluster, how can we specify the name of
> this NN in format hdfs://*namenode-host*:8020/.
>
> Is there a standard name by which we can identify the NN server ?
>
> Thanks
> Sachin
>
>
> On Fri, Mar 22, 2024 at 12:08 PM Asimansu Bera <asimansu.b...@gmail.com>
> wrote:
>
>> Hello Sachin,
>>
>> Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes
>> down or VMs are required to be shut down for security updates or due to
>> faults, new VMs will be added to the cluster. As a result, any data stored
>> in the local file system, such as file://tmp, would be lost. To ensure data
>> persistence and prevent loss of checkpoint or savepoint data for recovery,
>> it is advisable to store such data in a persistent storage solution like
>> HDFS or S3.
>>
>> Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP
>> details from EMR service.
>>
>> Hope this helps.
>>
>> -A
>>
>>
>> On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal <sjmit...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> We are using AWS EMR where we can submit our flink jobs to a long
>>> running flink cluster on Yarn.
>>>
>>> We wanted to configure RocksDBStateBackend as our state backend to store
>>> our checkpoints.
>>>
>>> So we have configured following properties in our flink-conf.yaml
>>>
>>>    - state.backend.type: rocksdb
>>>    - state.checkpoints.dir: file:///tmp
>>>    - state.backend.incremental: true
>>>
>>>
>>> My question here is regarding the checkpoint location: what is the
>>> difference between the location if it is a local filesystem vs a hadoop
>>> distributed file system (hdfs).
>>>
>>> What advantages we get if we use:
>>>
>>> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints
>>> vs
>>> *state.checkpoints.dir*: file:///tmp
>>>
>>> Also if we decide to use HDFS then from where we can get the value for
>>> *namenode-host:port*
>>> given we are running Flink on an EMR.
>>>
>>> Thanks
>>> Sachin
>>>
>>>
>>>

Reply via email to