Aggregating Movie Rental information in a DynamoDB table using DynamoDB streams and Flink

Joe Hansen Wed, 25 Dec 2019 09:42:37 -0800

Happy Holidays everyone!

tl;dr: I need to aggregate movie rental information that is being
stored in one DynamoDB table and store running total of the
aggregation in another table. How do I ensure exactly-once
aggregation.

I currently store movie rental information in a DynamoDB table named
MovieRentals: {movie_title, rental_period_in_days, order_date,
rent_amount}

We have millions of movie rentals happening on any given day. Our web
application needs to display the aggregated rental amount for any
given movie title.

I am planning to use Flink to aggregate rental amounts by movie_title
on the MovieRental DynamoDB stream and store the aggregated rental
amounts in another DynamoDB table named RentalAmountsByMovie:
{movie_title, total_rental_amount}

How do I ensure that RentalAmountsByMovie amounts are accurate. i.e.
How do I prevent results from any checkpoint from not updating the
RentalAmountsByMovie table records more than once?

1) Do I need to store checkpoint ids in the RentalAmountsByMovie table
and do conditional updates to handle the scenario described above?
2) I can possibly implement TwoPhaseCommitSinkFunction that talks to
DynamoDB. However, according to Flink documentation the commit
function can be called more than once and hence needs to be
idempotent. So even this solution requires checkpoint-ids to be stored
on the target store.
3) Another pattern seems to be storing the time-window aggregation
results in the RentalAmountsByMovie table. And the webapp will have to
compute the running total on the fly. I don't like this solution for
its latency implications to the webapp.
4) May be I can use Flink's Queryable state feature. However, that
feature seems to be in Beta:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/queryable_state.html

I imagine this is a very common aggregation use case. How do folks
usually handle **updating aggregated results in Flink external
sinks**?

I appreciate any pointers. Happy to provide more details if needed.

Thanks!

Aggregating Movie Rental information in a DynamoDB table using DynamoDB streams and Flink

Reply via email to