Hi Shammon,
Sorry for the inaccurate description of my last reply. Let me restate my
question again:
Fact 1: we know that ValueState here should not
serialized/de-serialized, so it is a good practice to mark it with
"transient".
Fact 2: on the other hand, if we don't mark it with "transient", it will
be initialized to null, and this null value will be
serialized/de-serialized. I think it will occur some overhead if the
number of partitions is very large.
Given the two facts above, the program works well in both cases in terms
of accuracy. And my question is: is there any performance benchmark in
real (large) applications to compare two cases?
Feel free to point out if I've misunderstood.
On 2023/02/24 11:01:51 Shammon FY wrote:
> Hi
>
> Sorry that I don't quite understand your question. I think the above
> functions will only be deserialized when the job is submitted, do you
want
> to test the impact of this on submission throughput?
>
> Best,
> Shammon
>
>
> On Fri, Feb 24, 2023 at 3:04 PM Zhongpu Chen <ch...@gmail.com> wrote:
>
> > Hi Gen,
> >
> > Thanks for your explanation.
> >
> > Back to this code snippet, since they are not marked with "transient"
> > now, I suppose Flink will use avro to serialize them (null values). Is
> > there any benchmark to show the performance test between null values
> > serialization and "transient"? I mean, it is indeed not good to write
> > them with "transient", but it works. So is there any performance
lose here?
> >
> >
> > On 2023/02/24 06:47:21 Gen Luo wrote:
> > > Hi,
> > >
> > > ValueState is a handle rather than an actual value. So it should
never
> > be
> > > serialized. In fact, ValueState itself is not a Serializable. It
> > should be
> > > ok to always mark it as transient.
> > >
> > > In this case, I suppose it works because the ValueState is not set
> > (which
> > > happens during the runtime) when the function is serialized (while
> > > deploying). But it's not good.
> > >
> > > On Fri, Feb 24, 2023 at 10:29 AM Zhongpu Chen <ch...@gmail.com>
wrote:
> > >
> > > > Hi,
> > > >
> > > > When I am reading the code from flink-training-repo [1], I
noticed the
> > > > following code:
> > > >
> > > > ```java
> > > >
> > > > public static class EnrichmentFunction
> > > > extends RichCoFlatMapFunction<TaxiRide, TaxiFare, RideAndFare> {
> > > >
> > > > private ValueState<TaxiRide> rideState; private
> > ValueState<TaxiFare> fareState;
> > > > ...
> > > > }
> > > >
> > > > ```
> > > >
> > > > From my understanding, since ValueState variables here are scoped
> > to each
> > > > instance, they should not be serialized for the performance sake.
> > Thus, we
> > > > should always mark them with "transient". Similar discussion can be
> > found
> > > > here [2].
> > > >
> > > > Should we always mark ValueState as "transient", and why? Please
> > help me
> > > > to figure it out.
> > > >
> > > > [1]
> > > >
> >
> >
https://github.com/apache/flink-training/blob/master/rides-and-fares/src/solution/java/org/apache/flink/training/solutions/ridesandfares/RidesAndFaresSolution.java
> > > >
> > > > [2]
> > > >
> >
> >
https://stackoverflow.com/questions/72556202/flink-managed-state-as-transient
> > > >
> > >
> >
>