Flink Dataset job submission very slow

2020-05-18 Thread ysnakie
I have many lzo files on HDFS in such path format: /logs/{id}/{date}/xxx[1-100].lzo/logs/a/ds=2018-01-01/xxx1.lzo/logs/b/ds=2018-01-01/xxx1.lzo.../logs/z/ds=2018-01-02/xxx1.lzo.../logs/z/ds=2020-05-01/xxx100.lzoI'am using Flink Dataset to read those files by a range of {date} and a

checkpointing opening too many file

2020-04-24 Thread ysnakie
Hi everyone We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing.  We use yarn to schedu