We have a beam/dataflow pipeline that reads from pubsub. as a first step in
pipeline we do DeduplicatePerKey of incoming records based on key and
event_time_duration of 4 hours. we notice that this increases the walltime
of this step a lot (in days) . And Dataflow runner autoscales a lot. It
easily reaches 100 workers and stays within 60-100 ranges. We
tried reducing the deduping window to 30 mins (and 1 hour window) but
walltime is still in days and workers stays in 60- 100 range. every end of
the window worker count drops but not much. We noticed high sustained CPU
(over 90%) for all workers and low memory usages. This happens just over
20M pubsub message read. Is DeduplicatePerKey is just an expensive
operation or is it due to window size?

ALso can't determine if auto scaling is due to increased walltime/data
freshnes Or (due to workers being held to store state and same workers
can't be use for other transforms) Or deduping is very costly. (there's not
much api doc on how DeduplicatePerKey works. I assume it will
passthrough an event with same key and store it same time so next time if
event with same key comes through withing the event_time_duration it will
drop that new event and not let it passthrough. not sure why that lookup be
so expensive.

Reply via email to