[jira] [Comment Edited] (FLINK-31946) DynamoDB Sink Allow Multiple Item Writes

Curtis Jensen (Jira) Tue, 16 May 2023 10:54:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723237#comment-17723237
 ]


Curtis Jensen edited comment on FLINK-31946 at 5/16/23 5:52 PM:
----------------------------------------------------------------

Hello [~liangtl] 
Thank you for the reply.

To better describe a use case, I have aggregations data for how many times a 
user logs in from a specific ip address.  I also have aggregations for how many 
times any user logs in from that ip address.  These are two separate DynamoDB 
Items with different partition keys.  For two different accounts logging in I 
might have DynamoDB items like:

{{partition_key | sort_key   | count}}
{{-------------{-}|{-}----------{-}|{-}-----}}
{{accountid-xxx | ip-1.1.1.1 | 1     }}{{# number of times account xxx logged 
in from ip 1}}
{{accountid-xxx | ip-1.1.1.2 | 1     }}{{# number of times account xxx logged 
in from ip 2}}
{{accountid-yyy | ip-1.1.1.1 | 1}}
{{ip-1.1.1.1    |  total     | 2     }}{{# number of times any account logged 
in from ip 1}}
{{ip-1.1.1.2    |  total     | 1     }}{{# number of times any account logged 
in from ip 2}}

 

When making a query for counts by an account id, I also need total stats for 
each ip address they log in from.  So I have to make and additional query for 
each ip address.  I would like to optimize the query by duplicating the ip 
total entries for each record with the account partition_key, making a table 
like:

{{partition_key | sort_key         | count}}
{{-------------{-}|{-}----------------{-}|{-}-----}}
{{accountid-xxx | ip-1.1.1.1       | 1      }}
{{accountid-xxx | ip-1.1.1.2       | 1      }}
{{accountid-xxx | ip-1.1.1.1-total | 2      # duplicate from pk: ip-1.1.1.1}}
{{accountid-xxx | ip-1.1.1.2-total | 1      }}{{# duplicate from pk: 
ip-1.1.1.2}}
{{accountid-yyy | ip-1.1.1.1       | 1}}
{{{}accountid-yyy | ip-1.1.1.1-total | 2      # duplicate from pk: 
ip-1.1.1.1{}}}{{{{}}{}}}
{{ip-1.1.1.1    |  total           | 2}}
{{ip-1.1.1.2    |  total           | 1}}

 

This would allow me to get all the aggregation data for the account and ip 
address with one query (by accountid-xxx) instead of 3 queries (by 
accountid-xxx, ip-1.1.1.1, and ip-1.1.1.2).

I could accomplish this with a GSI, but that would increase my DynamoDB cost.

I have been able to accomplish this using a FlatMap function.  However, this 
complicates my code and increases the number of tasks in my Flink Application.

The simplest and most cost effective solution would be to be able to insert 
multiple DynamoDB items from a single aggregation.

 

 


was (Author: JIRAUSER300083):
Hello [~liangtl] 
Thank you for the reply.

To better describe a use case, I have aggregations data for how many times a 
user logs in from a specific ip address.  I also have aggregations for how many 
times any user logs in from that ip address.  These are two separate DynamoDB 
Items with different partition keys.  For two different accounts logging in I 
might have DynamoDB items like:

{{partition_key | sort_key   | count}}
{{--------------|------------|------}}
{{accountid-xxx | ip-1.1.1.1 | 1     }}{{{}# number of times account xxx logged 
in from ip 1{}}}{{{}{}}}
{{accountid-xxx | ip-1.1.1.2 | 1     }}{{{}# number of times account xxx logged 
in from ip 2{}}}{{{}{}}}
{{accountid-yyy | ip-1.1.1.1 | 1}}
{{ip-1.1.1.1    |  total     | 2     }}{{{}# number of times any account logged 
in from ip 1{}}}{{{}{}}}
{{ip-1.1.1.2    |  total     | 1     }}{{{}# number of times any account logged 
in from ip 2{}}}{{{}{}}}

 

When making a query for counts by an account id, I also need total stats for 
each ip address they log in from.  So I have to make and additional query for 
each ip address.  I would like to optimize the query by duplicating the ip 
total entries for each record with the account partition_key, making a table 
like:

{{partition_key | sort_key         | count}}
{{--------------|------------------|------}}
{{accountid-xxx | ip-1.1.1.1       | 1      }}
{{accountid-xxx | ip-1.1.1.2       | 1      }}
{{accountid-xxx | ip-1.1.1.1-total | 2      # duplicate from pk: ip-1.1.1.1}}
{{accountid-xxx | ip-1.1.1.2-total | 1      }}{{# duplicate from pk: 
ip-1.1.1.2}}
{{accountid-yyy | ip-1.1.1.1       | 1}}
{{accountid-yyy | ip-1.1.1.1-total | 2      }}{{{}# duplicate from pk: 
ip-1.1.1.1{}}}{{{}{}}}
{{ip-1.1.1.1    |  total           | 2}}
{{ip-1.1.1.2    |  total           | 1}}

 

This would allow me to get all the aggregation data for the account and ip 
address with one query (by accountid-xxx) instead of 3 queries (by 
accountid-xxx, ip-1.1.1.1, and ip-1.1.1.2).

I could accomplish this with a GSI, but that would increase my DynamoDB cost.

I have been able to accomplish this using a FlatMap function.  However, this 
complicates my code and increases the number of tasks in my Flink Application.



The simplest and most cost effective solution would be to be able to insert 
multiple DynamoDB items from a single aggregation.

 

 

> DynamoDB Sink Allow Multiple Item Writes
> ----------------------------------------
>
>                 Key: FLINK-31946
>                 URL: https://issues.apache.org/jira/browse/FLINK-31946
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / DynamoDB
>            Reporter: Curtis Jensen
>            Priority: Minor
>
> In some cases, it is desirable to be able to write aggregation data to 
> multiple partition keys.  This supports the case of denormalizing data to 
> facilitate more efficient read operations.
> However, the DynamoDBSink allows for only a single DynamoDB item to be 
> generated for each Flink Element.  This appears to be a limitation of the 
> ElementConverter more than DyanmoDBSink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31946) DynamoDB Sink Allow Multiple Item Writes

Reply via email to