distcp can upload a directory tree of changed files, for cloud storage it looks for the different in file timestamps
Otherwise, the HDFS namenode has an log4j audit logger: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit This prints out all namenode filesystem operations in a structured format (which is consistent across releases)...you can read that and use whatever copy command you want to use to upload it. https://stackoverflow.com/questions/44533589/hdfs-audit-logs-format-and-explanation AFAIK There's no tool which takes a list of changed files and uploads them; I've played with using spark for this but it's not easy. the cloud store app we use for diagnostics has a 'cloudup` command to upload a directory tree to cloud storage (or any hadoop filesystem to filesystem) https://github.com/steveloughran/cloudstore https://github.com/steveloughran/cloudstore/blob/main/src/main/site/cloudup.md That's a single process but it is aggressive, optimised for cloud storage with rate limiting by shard https://github.com/steveloughran/cloudstore/blob/main/src/main/site/cloudup.md you could probably modify that to take a list of source + destination paths to build its upload list and let it do the work. You would have to do that on your own you, I'm afraid -through a PR of your changes would be welcome, especially with tests. steve On Wed, 12 Mar 2025 at 04:35, Arif Kamil Yilmaz <a_k_yil...@yahoo.com.invalid> wrote: > > Subject: Data copy from HDFS to MinIO regularly > > Hello Team, > > There is an application that was developed a long time ago, and this > application processes 10GB of binary data per hour using MapReduce and > generates 100GB of data, which is then written to the HDFS file system. > > My goal is to move a portion of the processed data (approximately 25%) to > a MinIO cluster that I plan to use as new object storage. I want this > operation to be repeated every time new data is added to the HDFS cluster. > > What kind of solution would you suggest to complete this task? > Additionally, I would like to remind you that I have requirements related > to monitoring the pipeline I am developing. > > Thank you. > >