[GitHub] flink pull request #3184: [5456] [docs] state intro and new interfaces

aljoscha Fri, 27 Jan 2017 08:41:54 -0800

Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3184#discussion_r98240690
  
    --- Diff: docs/dev/stream/state.md ---
    @@ -39,40 +39,247 @@ if necessary) to allow applications to hold very large 
state.
     This document explains how to use Flink's state abstractions when 
developing an application.
     
     
    -## Keyed State and Operator state
    +## Operator State and Keyed State
     
    -There are two basic state backends: `Keyed State` and `Operator State`.
    +There are two basic kinds of state in Flink: `Operator State` and `Keyed 
State`.
     
    -#### Keyed State
    +### Operator State
     
    -*Keyed State* is always relative to keys and can only be used in functions 
and operators on a `KeyedStream`.
    -Examples of keyed state are the `ValueState` or `ListState` that one can 
create in a function on a `KeyedStream`, as
    -well as the state of a keyed window operator.
    -
    -Keyed State is organized in so called *Key Groups*. Key Groups are the 
unit by which keyed state can be redistributed and
    -there are as many key groups as the defined maximum parallelism.
    -During execution each parallel instance of an operator gets one or more 
key groups.
    +With *Operator State* (or *non-keyed state*), each operator state is
    +bound to one parallel operator instance.
    +The Kafka source connector is a good motivating example for the use of 
Operator State
    +in Flink. Each parallel instance of this Kafka consumer maintains a map
    +of topic partitions and offsets as its Operator State.
     
    -#### Operator State
    +New interfaces in Flink 1.2 subsume the `Checkpointed` interface in Flink 
1.0 and
    +1.1, which has been deprecated. 
    +These new Operator State interfaces support redistributing state among
    +parallel operator instances when the parallelism is changed. 
     
    -*Operator State* is state per parallel subtask. It subsumes the 
`Checkpointed` interface in Flink 1.0 and Flink 1.1.
    -The new `CheckpointedFunction` interface is basically a shortcut 
(syntactic sugar) for the Operator State.
    -
    -Operator State needs special re-distribution schemes when parallelism is 
changed. There can be different variations of such
    -schemes; the following are currently defined:
    +There can be different schemes for doing this redistribution; the 
following are currently defined:
     
       - **List-style redistribution:** Each operator returns a List of state 
elements. The whole state is logically a concatenation of
         all lists. On restore/redistribution, the list is evenly divided into 
as many sublists as there are parallel operators.
         Each operator gets a sublist, which can be empty, or contain one or 
more elements.
     
     
    +### Keyed State
    +
    +*Keyed State* is always relative to keys and can only be used in functions 
and operators on a `KeyedStream`.
    +
    +You can think of Keyed State as Operator State that has been partitioned,
    +or sharded, with exactly one state-partition per key. 
    +Each keyed-state is logically bound to a unique
    +composite of <parallel-operator-instance, key>, and since each key
    +"belongs" to exactly one parallel instance of a keyed operator, we can
    +think of this simply as <operator, key>. 
    +
    +Keyed State is further organized into so-called *Key Groups*. Key Groups 
are the
    +atomic unit by which Flink can redistribute Keyed State; 
    +there are exactly as many Key Groups as the defined maximum parallelism.
    +During execution each parallel instance of a keyed operator works with the 
keys
    +for one or more Key Groups.
    +
    +
     ## Raw and Managed State
     
     *Keyed State* and *Operator State* exist in two forms: *managed* and *raw*.
     
     *Managed State* is represented in data structures controlled by the Flink 
runtime, such as internal hash tables, or RocksDB.
    -Examples are "ValueState", "ListState", etc. Flink's runtime encodes the 
states and writes them into the checkpoints.
    +Examples are "ValueState", "ListState", etc. Flink's runtime encodes
    +the states and writes them into the checkpoints. 
     
    -*Raw State* is state that users and operators keep in their own data 
structures. When checkpointed, they only write a sequence of bytes into
    +*Raw State* is state that operators keep in their own data structures. 
When checkpointed, they only write a sequence of bytes into
     the checkpoint. Flink knows nothing about the state's data structures and 
sees only the raw bytes.
     
    +All datastream functions can use managed state, but the raw state 
interfaces can only be used when implementing operators. 
    +Using managed state (rather than raw state) is recommended, since with
    +managed state Flink is able to automatically redistribute state when the 
parallelism is
    +changed, and also do better memory management.
    +
    +
    +## Using Managed Operator State
    +
    +A stateful function can implement either the more general 
`CheckpointedFunction` 
    +interface, or the `ListCheckpointed<T extends Serializable>` interface 
(which is semantically closer to the old 
    +`Checkpointed` one).
    +
    +[The Flink Function Migration documentation](../migration.html) has
    --- End diff --
    
    I think we shouldn't refer to the migration guide here because these 
interfaces are not only valid in a migration context but are valid for any 
Flink job.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3184: [5456] [docs] state intro and new interfaces

Reply via email to