On Mon, 11 Feb 2019 at 13:26, Stephen Connolly <
stephen.alan.conno...@gmail.com> wrote:

> I have my main application updating with a blue-green deployment strategy
> whereby a new version (always called green) starts receiving an initial
> fraction of the web traffic and then - based on the error rates - we
> progress the % of traffic until 100% of traffic is being handled by the
> green version. At which point we decommission blue and green is the new
> blue when the next version comes along.
>
> Applied to Flink, my initial thought is that you would run the two
> topologies in parallel, but the first action of each topology would be a
> filter based on the key.
>
> You basically would use a consistent transformation of the key into a
> number between 0 and 100 and the filter would be:
>
>     (key) -> color == green ? f(key) < level : f(key) >= level
>
> Then I can use a suitable metric to determine if the new topology is
> working and ramp up or down the level.
>
> One issue I foresee is what happens if the level changes mid-window, I
> will have output from both topologies when the window ends.
>
> In the case of my output, which is aggregatable, I will get the same
> results from two rows as from one row *provided* that the switch from blue
> to green is synchronized between the two topologies. That sounds like a
> hard problem though.
>
>
> Another thought I had was to let the web front-end decide based on the
> same key vs level approach. Rather than submit the raw event, I would add
> the target topology to the event and the filter just selects based on
> whether it is the target topology. This has the advantage that I know each
> event will only ever be processed by one of green or blue. Heck I could
> even use the main web application's blue-green deployment to drive the
> flink blue green deployment
>

In other words, if a blue web node receives an event upload it adds "blue",
whereas if a green web node receives an event upload it adds "green" (not
quite those strings but rather the web deployment sequence number). This
has the advantage that the web nodes do not need to parse the event
payload. The % of web traffic will result in the matching % of events being
sent to blue and green. Also this means that all keys get processed at the
target % during the deployment, which can help flush out bugs.

I can therefore stop the old topology at > 1 window after the green web
node started getting 100% of traffic in order to allow any existing windows
in flight to flush all the way to the datastore...

Out of order events would be tagged as green once green is 100% of traffic,
and so can be processed correctly...

And I can completely ignore topology migration serialization issues...

Sounding very tempting... there must be something wrong...

(or maybe my data storage plan just allows me to make this kind of
optimization!)


> as due to the way I structure my results I don't care if I get two rows of
> counts for a time window or one row of counts, because I'm adding up the
> total counts across multiple rows and sum is sum!
>

>
> Anyone else had to try and deal with this type of thing?
>
> -stephenc
>
>
>

Reply via email to