Re: Design question regarding streaming and sorting

Lukasz Cwik Sat, 24 Aug 2019 09:41:27 -0700

Side inputs don't change for the lifetime of a bundle. Only on new bundles
would you get a possibly updated new view of the side input so you may not
see the changes to priority as quickly as you may expect. How quickly this
happens is all dependent on the runner's internal implementation details.


Is there a reason you can't process the rows in parallel?
Can you pause inflight work on a row arbitrarily or does whatever work that
you start must complete?

On Sat, Aug 24, 2019 at 2:11 AM Juan Carlos Garcia <jcgarc...@gmail.com>
wrote:

> Sorry I hit too soon, then after updating the priority I would execute
> group by DoFn and then use an external sorting (by this priority key) to
> guarantee that on the next DoFn you have a sorted Iterable.
>
>
> JC
>
> Juan Carlos Garcia <jcgarc...@gmail.com> schrieb am Sa., 24. Aug. 2019,
> 11:08:
>
>> Hi,
>>
>> The main puzzle here is how to deliver the priority change of a row
>> during a given window, my best shot would be to have a side input
>> (PCollectionView) containing the change of priority, then in the slow
>> worker beam transform extract this side input and update the corresponding
>> row with the new priority.
>>
>> Again is just an idea.
>>
>> JC
>>
>> Chad Dombrova <chad...@gmail.com> schrieb am Sa., 24. Aug. 2019, 03:32:
>>
>>> Hi all,
>>> Our team is brainstorming how to solve a particular type of problem with
>>> Beam, and it's a bit beyond our experience level, so I thought I'd turn to
>>> the experts for some advice.
>>>
>>> Here are the pieces of our puzzle:
>>>
>>>    - a data source with the following properties:
>>>       - rows represent work to do
>>>       - each row has an integer priority
>>>       - rows can be added or deleted
>>>       - priorities of a row can be changed
>>>       - <10k rows
>>>    - a slow worker Beam transform (Map/ParDo) that consumes a row at a
>>>    time
>>>
>>>
>>> We want a streaming pipeline that delivers rows from our data store to
>>> the worker transform,  resorting the source based on priority each time a
>>> new row is delivered.  The goal is that last second changes in priority can
>>> affect the order of the slowly yielding read.  Throughput is not a major
>>> concern since the worker is the bottleneck.
>>>
>>> I have a few questions:
>>>
>>>    - is the sort of problem that BeamSQL can solve? I'm not sure how
>>>    sorting and resorting are handled there in a streaming context...
>>>    - I'm unclear on how back-pressure in Flink affects streaming reads.
>>>    It's my hope that data/messages are left in the data source until
>>>    back-pressure subsides, rather than read eagerly into memory.  Can 
>>> someone
>>>    clarify this for me?
>>>    - is there a combination of windowing and triggering that can solve
>>>    this continual resorting plus slow yielding problem?  It's not 
>>> unreasonable
>>>    to keep all of our rows in memory on Flink, as long as we're snapshotting
>>>    state.
>>>
>>>
>>> Any advice on how an expert Beamer would solve this is greatly
>>> appreciated!
>>>
>>> Thanks,
>>> chad
>>>
>>>

Re: Design question regarding streaming and sorting

Reply via email to