Re: Anyone tried to do blue-green topology deployments?

Stephen Connolly Mon, 11 Feb 2019 07:03:25 -0800

On Mon, 11 Feb 2019 at 14:10, Stephen Connolly <
stephen.alan.conno...@gmail.com> wrote:


> Another possibility would be injecting pseudo events into the source and
> having a stateful filter.
>
> The event would be something like “key X is now owned by green”.
>
> I can do that because getting a list of keys seen in the past X minutes is
> cheap (we have it already)
>
> But it’s unclear what impact would be adding such state to the filter
>

Hmmm might not need to be quite so stateful, if the filter was implemented
as a BroadcastProcessFunction or a KeyedBroadcastProcessFunction, I could
run the key -> threshold and compare to the level from the Broadcast
context... that way the broadcast events wouldn't need to be associated
with any specific key and could just be {"level":56}


>
> On Mon 11 Feb 2019 at 13:33, Stephen Connolly <
> stephen.alan.conno...@gmail.com> wrote:
>
>>
>>
>> On Mon, 11 Feb 2019 at 13:26, Stephen Connolly <
>> stephen.alan.conno...@gmail.com> wrote:
>>
>>> I have my main application updating with a blue-green deployment
>>> strategy whereby a new version (always called green) starts receiving an
>>> initial fraction of the web traffic and then - based on the error rates -
>>> we progress the % of traffic until 100% of traffic is being handled by the
>>> green version. At which point we decommission blue and green is the new
>>> blue when the next version comes along.
>>>
>>> Applied to Flink, my initial thought is that you would run the two
>>> topologies in parallel, but the first action of each topology would be a
>>> filter based on the key.
>>>
>>> You basically would use a consistent transformation of the key into a
>>> number between 0 and 100 and the filter would be:
>>>
>>>     (key) -> color == green ? f(key) < level : f(key) >= level
>>>
>>> Then I can use a suitable metric to determine if the new topology is
>>> working and ramp up or down the level.
>>>
>>> One issue I foresee is what happens if the level changes mid-window, I
>>> will have output from both topologies when the window ends.
>>>
>>> In the case of my output, which is aggregatable, I will get the same
>>> results from two rows as from one row *provided* that the switch from blue
>>> to green is synchronized between the two topologies. That sounds like a
>>> hard problem though.
>>>
>>>
>>> Another thought I had was to let the web front-end decide based on the
>>> same key vs level approach. Rather than submit the raw event, I would add
>>> the target topology to the event and the filter just selects based on
>>> whether it is the target topology. This has the advantage that I know each
>>> event will only ever be processed by one of green or blue. Heck I could
>>> even use the main web application's blue-green deployment to drive the
>>> flink blue green deployment
>>>
>>
>> In other words, if a blue web node receives an event upload it adds
>> "blue", whereas if a green web node receives an event upload it adds
>> "green" (not quite those strings but rather the web deployment sequence
>> number). This has the advantage that the web nodes do not need to parse the
>> event payload. The % of web traffic will result in the matching % of events
>> being sent to blue and green. Also this means that all keys get processed
>> at the target % during the deployment, which can help flush out bugs.
>>
>> I can therefore stop the old topology at > 1 window after the green web
>> node started getting 100% of traffic in order to allow any existing windows
>> in flight to flush all the way to the datastore...
>>
>> Out of order events would be tagged as green once green is 100% of
>> traffic, and so can be processed correctly...
>>
>> And I can completely ignore topology migration serialization issues...
>>
>> Sounding very tempting... there must be something wrong...
>>
>> (or maybe my data storage plan just allows me to make this kind of
>> optimization!)
>>
>>
>>> as due to the way I structure my results I don't care if I get two rows
>>> of counts for a time window or one row of counts, because I'm adding up the
>>> total counts across multiple rows and sum is sum!
>>>
>>
>>>
>>> Anyone else had to try and deal with this type of thing?
>>>
>>> -stephenc
>>>
>>>
>>> --
> Sent from my phone
>

Re: Anyone tried to do blue-green topology deployments?

Reply via email to