I have been moving some old MR and hive workflows into Flink because I'm
enjoying the api's and the ease of development is wonderful.  Things have
largely worked great until I tried to really scale some of the jobs
recently.

I have for example one etl job that reads in about 12B records at a time
and does a sort, some simple transformations, validation, a re-partition
and then output to a hive table.
When I built it with the sample set, ~200M, it worked great, took maybe a
minute and blew threw it.

What I have observed is there is some kind of saturation reached depending
on number of slots, number of nodes and the overall size of data to move.
When I run the 12B set, the first 1B go through in under 1 minute, really
really fast.  But its an extremely sharp drop off after that, the next 1B
might take 15 minutes, and then if I wait for the next 1B, its well over an
hour.

What I cant find is any obvious indicators or things to look at, everything
just grinds to a halt, I don't think the job would ever actually complete.

Is there something in the design of flink in batch mode that is perhaps
memory bound?  Adding more nodes/tasks does not fix it, just gets me a
little further along.  I'm already running around ~1,400 slots at this
point, I'd postulate needing 10,000+ to potentially make the job run, but
thats too much of my cluster gone, and I have yet to get flink to be stable
past 1,500.

Any idea's on where to look, or what to debug?  GUI is also very cumbersome
to use at this slot count too, so other measurement ideas are welcome too!

Thank you all.

Reply via email to