Re: [Discuss][FLINK-8297]A solution for FLINK-8297 Timebased RocksDBListState

SHI Xiaogang Tue, 16 Apr 2019 18:53:48 -0700

Hi all,

I can provide more details about the private solution mentioned by Yun.

We noticed that the re-ordering of elements due to internal clock jump will
break the semantics of LIST.  So we decide not to use timestamp in the keys.
Instead, we format the key in a list state as STATE_NAME#SEQUENCE_NUMBER,
where SEQUENCE_NUMBER is generated by the state backend.
The state backend keeps the value of the next sequence number . To avoid
unnecessary I/O access, the next sequence number is maintained in memory.

When taking checkpoints, the state backend will save the next sequence
number to checkpoints and restore from them.
When the parallelism is changed, the state backend may be assigned more
than one values at restoring.  In such cases, the state backend will
restore with the largest one. That way, we can ensure that the elements
newly added to the list are always in the list tail, hence getting rid of
reordering.

Furthermore, in my opinion, we should support different types of states for
collections:
* List state: Current implementation of list state is very good for small
lists.
* Deque state, or dispersed list state: It also supports ordered access but
stores elements in multiple RocksDB entries. It may be suitable for large
lists as it avoids large memory foot-prints. But for small lists, its
performance may be poorer than ListState due to extra I/O operations. (BTW,
i think my current implementation of dispersed list states is not very
good. We can rework on it to find a better one.)
* Set state, or bag state: It can be implemented on top of map state. It
does not support ordered access, but can achieve relatively good
performance for large collections and allow access by value.

Regards,
Xiaogang

Andrey Zagrebin <and...@ververica.com> 于2019年4月16日周二 下午4:51写道：

> Hi Faxian,
>
> True, we can resolve timestamp conflicts putting values into the same row,
> good point.
> Then re-ordering in case of internal clock jump changes behaviour comparing
> with the list state we have now.
> In this case, it can be similar to dispersing elements by hash and we can
> call it a bag, not list.
>
> Best,
> Andrey
>
> On Tue, Apr 16, 2019 at 5:29 AM faxianzhao <faxianz...@gmail.com> wrote:
>
> > Hi Yun
> > I think whether atomic increased number or timestamp, the key point is
> > disperse the elements in the different keys.
> > My point is how to design a useful key.
> > For the atomic increased number, it will array the elements one by one
> but
> > I
> > think the key is useless. Because the max key is not the elements count,
> > when we implement the remove method.
> > Currently, for the CountEvictor, TimeEvictor and TTL scenario, we should
> > iteration all of the elements to find what we want. But if we use
> timestamp
> > key, we could terminal the iteration early to save performance or start
> > from
> > the available timestamp to iteration the rest elements.
> >
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >
>

Re: [Discuss][FLINK-8297]A solution for FLINK-8297 Timebased RocksDBListState

Reply via email to