anishshri-db commented on code in PR #50177: URL: https://github.com/apache/spark/pull/50177#discussion_r2029505806
########## docs/streaming/structured-streaming-state-data-source.md: ########## @@ -174,6 +174,24 @@ The following configurations are optional: <td>latest commited batchId</td> <td>Represents the last batch to read in the read change feed mode. This option requires 'readChangeFeed' to be set to true.</td> </tr> +<tr> Review Comment: I think its ok to keep in the table ? just made it explicit for the relevant options ########## docs/streaming/structured-streaming-transform-with-state.md: ########## @@ -0,0 +1,322 @@ +--- +layout: global +displayTitle: Structured Streaming Programming Guide +title: Structured Streaming Programming Guide +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +# Overview + +TransformWithState is the new arbitrary stateful operator in Structured Streaming since the Apache Spark 4.0 release. This operator is the next generation replacement for the old mapGroupsWithState/flatMapGroupsWithState API for arbitrary stateful processing in Apache Spark. + +This operator has support for an umbrella of features such as object-oriented stateful processor definition, composite types, automatic TTL based eviction, timers etc and can be used to build business-critical operational use-cases. + +# Language Support + +`TransformWithState` is available in Scala, Java and Python. Note that in Python, the operator name is called `transformWithStateInPandas` similar to other operators interacting with the Pandas interface in Apache Spark. + +# Components of a TransformWithState Query + +A transformWithState query typically consists of the following components: +- Stateful Processor - A user-defined stateful processor that defines the stateful logic +- Output Mode - Output mode for the query such as Append, Update etc +- Time Mode - Time mode for the query such as EventTime, ProcessingTime etc +- Initial State - An optional initial state batch dataframe used to pre-populate the state + +In the following sections, we will go through the above components in more detail. + +## Defining a Stateful Processor + +A stateful processor is the core of the user-defined logic used to operate on the input events. A stateful processor is defined by extending the StatefulProcessor class and implementing a few methods. + +A typical stateful processor deals with the following constructs: +- Input Records - Input records received by the stream +- State Variables - Zero or more class specific members used to store user state +- Output Records - Output records produced by the processor. Zero or more output records may be produced by the processor. + +A stateful processor uses the object-oriented paradigm to define the stateful logic. The stateful logic is defined by implementing the following methods: + - `init` - Initialize the stateful processor and define any state variables as needed Review Comment: Added a note below -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org