I have my main application updating with a blue-green deployment strategy
whereby a new version (always called green) starts receiving an initial
fraction of the web traffic and then - based on the error rates - we
progress the % of traffic until 100% of traffic is being handled by the
green version. At which point we decommission blue and green is the new
blue when the next version comes along.

Applied to Flink, my initial thought is that you would run the two
topologies in parallel, but the first action of each topology would be a
filter based on the key.

You basically would use a consistent transformation of the key into a
number between 0 and 100 and the filter would be:

    (key) -> color == green ? f(key) < level : f(key) >= level

Then I can use a suitable metric to determine if the new topology is
working and ramp up or down the level.

One issue I foresee is what happens if the level changes mid-window, I will
have output from both topologies when the window ends.

In the case of my output, which is aggregatable, I will get the same
results from two rows as from one row *provided* that the switch from blue
to green is synchronized between the two topologies. That sounds like a
hard problem though.


Another thought I had was to let the web front-end decide based on the same
key vs level approach. Rather than submit the raw event, I would add the
target topology to the event and the filter just selects based on whether
it is the target topology. This has the advantage that I know each event
will only ever be processed by one of green or blue. Heck I could even use
the main web application's blue-green deployment to drive the flink blue
green deployment as due to the way I structure my results I don't care if I
get two rows of counts for a time window or one row of counts, because I'm
adding up the total counts across multiple rows and sum is sum!


Anyone else had to try and deal with this type of thing?

-stephenc

Reply via email to