distcp can upload a directory tree of changed files, for cloud storage it
looks for the different in file timestamps

Otherwise, the HDFS namenode has an log4j audit logger:
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit

This prints out all namenode filesystem operations in a structured format
(which is consistent across releases)...you can read that and use whatever
copy command you want to use to upload it.
https://stackoverflow.com/questions/44533589/hdfs-audit-logs-format-and-explanation

AFAIK There's no tool which takes a list of changed files and uploads them;
I've played with using spark for this but it's not easy.

the cloud store app we use for diagnostics has a 'cloudup` command to
upload a directory tree to cloud storage (or any hadoop filesystem to
filesystem)
https://github.com/steveloughran/cloudstore
https://github.com/steveloughran/cloudstore/blob/main/src/main/site/cloudup.md

That's a single process but it is aggressive, optimised for cloud storage
with rate limiting by shard
https://github.com/steveloughran/cloudstore/blob/main/src/main/site/cloudup.md

you could probably modify that to take a list of source + destination paths
to build its upload list and let it do the work.

You would have to do that on your own you, I'm afraid -through a PR of your
changes would be welcome, especially with tests.

steve


On Wed, 12 Mar 2025 at 04:35, Arif Kamil Yilmaz
<a_k_yil...@yahoo.com.invalid> wrote:

>
> Subject: Data copy from HDFS to MinIO regularly
>
> Hello Team,
>
> There is an application that was developed a long time ago, and this
> application processes 10GB of binary data per hour using MapReduce and
> generates 100GB of data, which is then written to the HDFS file system.
>
> My goal is to move a portion of the processed data (approximately 25%) to
> a MinIO cluster that I plan to use as new object storage. I want this
> operation to be repeated every time new data is added to the HDFS cluster.
>
> What kind of solution would you suggest to complete this task?
> Additionally, I would like to remind you that I have requirements related
> to monitoring the pipeline I am developing.
>
> Thank you.
>
>

Reply via email to