Distribute Parallelism/Tasks within RichOutputFormat?

Hailu, Andreas [Engineering] Tue, 22 Dec 2020 18:54:38 -0800

Hi folks,

I've got a single RichOutputFormat which is comprised of two 
HadoopOutputFormats, let's call them A and B, each writing to different HDFS 
directories. If a Record matches a certain condition it's written using A, 
otherwise it's written with B. Currently, the parallelism that is set at the 
RichOutputFormat seems to propagates to both A & B - meaning if the parallelism 
set on the RichOutputFormat is 10, output A and B create 10 files even if A 
receives all the records and B receives none.


My app has knowledge about the ratio of records it expects will be sent to 
output A vs output B, and I would ideally like that pass that down through the 
RichOutputFormat. Meaning that if we have a parallelism of 10, and know that 
70% of the Records being sent go to A, I would like to supply the A with 7 
parallelism and B with 3.

I'm curious because the current approach can lead to lots of redundant empty 
files, and I'd like to minimize that if possible. Is something like this 
supported?

____________

Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.


________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Distribute Parallelism/Tasks within RichOutputFormat?

Reply via email to