Hi,
We’re trying to distribute batch input data to (N) HDFS files partitioning by
hash using DataSet API. What I’m doing is like:
env.createInput(…)
.partitionByHash(0)
.setParallelism(N)
.output(…)
This works well for small number of files. But when we need to distribute to
large number of files (say 100K), the parallelism becomes too large and we
could not afford that many TMs.
In spark we can write something like ‘rdd.partitionBy(N)’ and control the
parallelism separately (using dynamic allocation). Is there anything similar in
Flink or other way we can achieve similar result? Thank you!
Qi