Hi, Dian Fu. 

      I think John's requirement is like a cdc source that the source needs the 
ability to know which of datas should be deleted and then notify the framework, 
and that is why I recommendation John to use the UDTF.




And hi, John. 
      I'm not sure this doc [1] is enough. BTW, I think you can also try to 
customize a connector[2] to send `DELETE` RowData to downstream by java and use 
it in PyFlink SQL, and maybe it's more easy.




[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/table/udfs/python_udfs/#table-functions

[2] 
https://nightlies.apache.org/flink/flink-docs-master/zh/docs/dev/table/sourcessinks/#user-defined-sources--sinks




--

    Best!
    Xuyang




在 2022-06-09 08:53:36,"Dian Fu" <dian0511...@gmail.com> 写道:

Hi John,

If you are using Table API & SQL, the framework is handling the RowKind and 
it's transparent for you. So usually you don't need to handle RowKind in Table 
API & SQL.

Regards,
Dian


On Thu, Jun 9, 2022 at 6:56 AM John Tipper <john_tip...@hotmail.com> wrote:

Hi Xuyang,


Thank you very much, I’ll experiment tomorrow. Do you happen to know whether 
there is a Python example of udtf() with a RowKind being set (or whether it’s 
supported)?


Many thanks,


John


Sent from my iPhone

On 8 Jun 2022, at 16:41, Xuyang <xyzhong...@163.com> wrote:



Hi, John.
What about use udtf [1]?
In your UDTF, all resources are saved as a set or map as s1. When t=2 arrives, 
the new resources as s2 will be collected by crawl. I think what you want is 
the deletion data that means 's1' - 's2'.
So just use loop to find out the deletion data and send RowData in function 
'eval' in UDTF, and the RowData can be sent with a RowKind 'DELETE'[2]. The 
'DELETE' means tell the downstream that this value is deleted.

I will be glad if it can help you.

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#table-functions
[2] 
https://github.com/apache/flink/blob/44f73c496ed1514ea453615b77bee0486b8998db/flink-core/src/main/java/org/apache/flink/types/RowKind.java#L52







--

    Best!
    Xuyang




At 2022-06-08 20:06:17, "John Tipper" <john_tip...@hotmail.com> wrote:

Hi all,


I have some reference data that is periodically emitted by a crawler mechanism 
into an upstream Kinesis data stream, where those rows are used to populate a 
sink table (and where I am using Flink 1.13 PyFlink SQL within AWS Kinesis Data 
Analytics).  What is the best pattern to handle deletion of upstream data, such 
that the downstream table remains in sync with upstream?


For example, at t=1, rows R1, R2, R3 are processed from the stream, resulting 
in a DB with 3 rows.  At some point between t=1 and t=2, the resource 
corresponding to R2 was deleted, such that at t=2 when the next crawl was 
carried out only rows R1 and R2 were emitted into the upstream stream.  How 
should I process the stream of events so that when I have finished processing 
the events from t=2 my downstream table also has just rows R1 and R3?


Many thanks,


John

Reply via email to