Re: [External] Re: From Kafka Stream to Flink

Maatary Okouya Fri, 27 Mar 2020 23:48:00 -0700

Hi all,

Just wondering what is the status at this point?


On Thu, Sep 19, 2019 at 4:38 PM Hequn Cheng <chenghe...@gmail.com> wrote:

> Hi,
>
> Fabian is totally right. Big thanks to the detailed answers and nice
> examples above.
>
> As for the PR, very sorry about the delay. It is mainly because of the
> merge of blink and my work switching to Flink Python recently.
> However, I think the later version of blink would cover this feature
> natively with further merges.
>
> Before that, I think we can use the solution Fabian provided above.
>
> There are some examples here[1][2] which may be helpful to you
> @Casado @Maatary.
> In [1], the test case quite matches your scenario(perform join after
> groupby+last_value). It also provides the udaf what you want and shows how
> to register it.
> In [2], the test shows how to use the built-in last_value in SQL. Note
> that the built-in last_value UDAF is only supported in blink-planner from
> flink-1.9.0. If you are using the flink-planner(or version before that),
> you can register the last_value UDAF with the TableEnvironment like it is
> showed in [1].
>
> Feel free to ask if there are other problems.
>
> Best, Hequn
> [1]
> https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/planner/plan/batch/sql/SubplanReuseTest.scala#L207
> [2]
> https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/planner/runtime/stream/sql/SplitAggregateITCase.scala#L228
>
> On Thu, Sep 19, 2019 at 9:40 PM Casado Tejedor, Rubén <
> ruben.casado.teje...@accenture.com> wrote:
>
>> Thanks Fabian. @Hequn Cheng <chenghe...@gmail.com> Could you share the
>> status? Thanks for your amazing work!
>>
>>
>>
>> *De: *Fabian Hueske <fhue...@gmail.com>
>> *Fecha: *viernes, 16 de agosto de 2019, 9:30
>> *Para: *"Casado Tejedor, Rubén" <ruben.casado.teje...@accenture.com>
>> *CC: *Maatary Okouya <maatarioko...@gmail.com>, miki haiat <
>> miko5...@gmail.com>, user <user@flink.apache.org>, Hequn Cheng <
>> chenghe...@gmail.com>
>> *Asunto: *Re: [External] Re: From Kafka Stream to Flink
>>
>>
>>
>> Hi Ruben,
>>
>>
>>
>> Work on this feature has already started [1], but stalled a bit (probably
>> due to the effort of merging the new Blink query processor).
>>
>> Hequn (in CC) is the guy working on upsert table ingestion, maybe he can
>> share what the status of this feature is.
>>
>>
>>
>> Best, Fabian
>>
>>
>>
>> [1] https://github.com/apache/flink/pull/6787
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_pull_6787&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=brkRAgrW3LbdVDOiRLzI7SFUIWBL5aa2MIfENljA8xoe0lFg2u3-S6GnFTH7Pbmc&m=rosq06snxAUogg77Y2SAZnaQef16zpmhGPcoGNUd4vg&s=0Mc6IZBBxqaJ6S_possk4V4ZTpdNphlZ3NoNPeL6NGA&e=>
>>
>>
>>
>> Am Di., 13. Aug. 2019 um 11:05 Uhr schrieb Casado Tejedor, Rubén <
>> ruben.casado.teje...@accenture.com>:
>>
>> Hi
>>
>>
>>
>> Do you have an expected version of Flink to include the capability to
>> ingest an upsert stream as a dynamic table? We have such need in our
>> current project. What we have done is to emulate such behavior working at
>> low level with states (e.g. update existing value if key exists, create a
>> new value if key does not exist). But we cannot use SQL that would help as
>> to do it faster.
>>
>>
>>
>> Our use case is many small flink jobs that have to something like:
>>
>>
>>
>> SELECT *some fields*
>>
>> FROM *t1* INNER JOIN *t1 on t1.id
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__t1.id&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=brkRAgrW3LbdVDOiRLzI7SFUIWBL5aa2MIfENljA8xoe0lFg2u3-S6GnFTH7Pbmc&m=rosq06snxAUogg77Y2SAZnaQef16zpmhGPcoGNUd4vg&s=5ReK8KBJ2AMxI8faigLTfxwAxvlvXbtPG48TzkLZbXc&e=>
>> = t2.id
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__t2.id&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=brkRAgrW3LbdVDOiRLzI7SFUIWBL5aa2MIfENljA8xoe0lFg2u3-S6GnFTH7Pbmc&m=rosq06snxAUogg77Y2SAZnaQef16zpmhGPcoGNUd4vg&s=BnXyZjU0mHMrZ-gu7wRz5GUirxitCuQcCFjd8nbVNyw&e=>
>> (maybe join +3 tables)*
>>
>> WHERE *some conditions on fields*;
>>
>>
>>
>> We need the result of that queries taking into account only the last
>> values of each row. The result is inserted/updated in a in-memory K-V
>> database for fast access.
>>
>>
>>
>> Thanks in advance!
>>
>>
>>
>> Best
>>
>>
>>
>> *De: *Fabian Hueske <fhue...@gmail.com>
>> *Fecha: *miércoles, 7 de agosto de 2019, 11:08
>> *Para: *Maatary Okouya <maatarioko...@gmail.com>
>> *CC: *miki haiat <miko5...@gmail.com>, user <user@flink.apache.org>
>> *Asunto: *[External] Re: From Kafka Stream to Flink
>>
>>
>>
>> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with
>> links and attachments.
>> ------------------------------
>>
>>
>>
>> Hi,
>>
>>
>>
>> LAST_VAL is not a built-in function, so you'd need to implement it as a
>> user-defined aggregate function (UDAGG) and register it.
>>
>>
>>
>> The problem with joining an append only table with an updating table is
>> the following.
>>
>>
>>
>> Consider two tables: users (uid, name, zip) and orders (oid, uid,
>> product), with user being an updating table and orders being append only.
>>
>>
>>
>> On January 1st, the tables look like this:
>>
>>
>>
>> Users:
>>
>> uid_1, Fred, 12345
>>
>> uid_2, Mary, 67890
>>
>>
>>
>> Orders
>>
>> oid_1, uid_1, Popcorn
>>
>> oid_2, uid_2, Carrots
>>
>>
>>
>> Joining both tables with the following query SELECT oid, product, name,
>> zip FROM users u, orders o WHERE u.uid = o.uid results in:
>>
>>
>>
>> oid_1, Popcorn, Fred, 12345
>>
>> oid_2, Carrots, Mary, 67890
>>
>>
>>
>> Whenever, a new order is appended, we look up the corresponding user
>> data, perform the join and emit the results.
>>
>> Let's say on July 1st we have received 100 orders from our two users all
>> is fine. However, on July 2nd Fred updates his zip code because he moved to
>> another city.
>>
>> Our data now looks like this:
>>
>>
>>
>> Users:
>>
>> uid_1, Fred, 24680
>>
>> uid_2, Mary, 67890
>>
>>
>>
>> Orders
>>
>> oid_1, uid_1, Popcorn
>>
>> oid_2, uid_2, Carrots
>>
>> ....
>>
>> oid_100, uid_2, Potatoes
>>
>>
>>
>> The result of the same query as before is:
>>
>>
>>
>> oid_1, Popcorn, Fred, 24680
>>
>> oid_2, Carrots, Mary, 67890
>>
>> ....
>>
>> oid_100, Potatoes, Mary, 67890
>>
>>
>>
>> Notice how the first row changed?
>>
>> If we strictly follow SQL semantics (which we do in Flink SQL) the query
>> needs to update the ZIP code of the first result row.
>>
>> In order to do so, we need access to the original data of the orders
>> table, which is the append only table in our scenario.
>>
>> Consequently, we need to fully materialize append only tables when they
>> are joined with an updating table without temporal constraints.
>>
>>
>>
>> In many situations, the indented semantics for such a query would be to
>> join the order with the ZIP code of the user *that was valid at the time
>> when the order was placed*.
>>
>> However, this is *not* semantics of the query of our example. For such a
>> query, we need to model the data differently. The users table needs to
>> store all modifications, i.e., the full history of all updates.
>>
>> Each update needs a timestamp and each order needs a timestamp as well.
>> With these timestamps, we can write a query that joins an order with the
>> user data that we valid at the time when the order was placed.
>>
>> This is the temporal constraint that I mentioned before. With this
>> constraint, Flink can use the information about progressing time to reason
>> about how much state it needs to keep because a change of the user table
>> will only affect future orders.
>>
>>
>>
>> Flink makes this a lot easier with the concept of temporal tables [1] and
>> temporal table joins [2].
>>
>>
>>
>> Best,
>>
>> Fabian
>>
>>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/streaming/temporal_tables.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.8_dev_table_streaming_temporal-5Ftables.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=brkRAgrW3LbdVDOiRLzI7SFUIWBL5aa2MIfENljA8xoe0lFg2u3-S6GnFTH7Pbmc&m=qsYWKlnwGvoMp8BTKycFnjrxmmg3GLXf_AbKoXOAb_A&s=dkqRgaz4ropnGdtxuW8gYVx8hJNuPKdgS3K7hge-qvY&e=>
>>
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/streaming/joins.html#join-with-a-temporal-table
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.8_dev_table_streaming_joins.html-23join-2Dwith-2Da-2Dtemporal-2Dtable&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=brkRAgrW3LbdVDOiRLzI7SFUIWBL5aa2MIfENljA8xoe0lFg2u3-S6GnFTH7Pbmc&m=qsYWKlnwGvoMp8BTKycFnjrxmmg3GLXf_AbKoXOAb_A&s=YVSwoF7IzyDGMzL-0n1o5BolAcRxd4qN0u0ITq1cwXE&e=>
>>
>>
>>
>>
>>
>> Am Di., 6. Aug. 2019 um 21:09 Uhr schrieb Maatary Okouya <
>> maatarioko...@gmail.com>:
>>
>> Fabian,
>>
>>
>>
>> ultimately, i just want to perform a join on the last values for each
>> keys.
>>
>>
>>
>> On Tue, Aug 6, 2019 at 8:07 PM Maatary Okouya <maatarioko...@gmail.com>
>> wrote:
>>
>> Fabian,
>>
>>
>>
>> could you please clarify the following statement:
>>
>>
>>
>> However joining an append-only table with this view without adding
>> temporal join condition, means that the stream is fully materialized as
>> state.
>>
>> This is because previously emitted results must be updated when the view
>> changes.
>>
>> It really depends on the semantics of the join and query that you need,
>> how much state the query will need to maintain.
>>
>>
>>
>>
>>
>> I am not sure to understand the problem. If i have to append-only table
>> and perform some join on it, what's the issue ?
>>
>>
>>
>>
>>
>> On Tue, Aug 6, 2019 at 8:03 PM Maatary Okouya <maatarioko...@gmail.com>
>> wrote:
>>
>> Thank you for the clarification. Really appreciated.
>>
>>
>>
>> Is Last_val part of the API ?
>>
>>
>>
>> On Fri, Aug 2, 2019 at 10:49 AM Fabian Hueske <fhue...@gmail.com> wrote:
>>
>> Hi,
>>
>>
>>
>> Flink does not distinguish between streams and tables. For the Table API
>> / SQL, there are only tables that are changing over time, i.e., dynamic
>> tables.
>>
>> A Stream in the Kafka Streams or KSQL sense, is in Flink a Table with
>> append-only changes, i.e., records are only inserted and never deleted or
>> modified.
>>
>> A Table in the Kafka Streams or KSQL sense, is in Flink a Table that has
>> upsert and delete changes, i.e., the table has a unique key and records are
>> inserted, deleted, or updated per key.
>>
>>
>>
>> In the current version, Flink does not have native support to ingest an
>> upsert stream as a dynamic table (right now only append-only tables can be
>> ingested, native support for upsert tables will be added soon.).
>>
>> However, you can create a view with the following SQL query on an
>> append-only table that creates an upsert table:
>>
>>
>>
>> SELECT key, LAST_VAL(v1), LAST_VAL(v2), ...
>>
>> FROM appendOnlyTable
>>
>> GROUP BY key
>>
>>
>>
>> Given, this view, you can run all kinds of SQL queries on it.
>>
>> However joining an append-only table with this view without adding
>> temporal join condition, means that the stream is fully materialized as
>> state.
>>
>> This is because previously emitted results must be updated when the view
>> changes.
>>
>> It really depends on the semantics of the join and query that you need,
>> how much state the query will need to maintain.
>>
>>
>>
>> An alternative to using Table API / SQL and it's dynamic table
>> abstraction is to use Flink's DataStream API and ProcessFunctions.
>>
>> These APIs are more low level and expose access to state and timers,
>> which are the core ingredients for stream processing.
>>
>> You can implement pretty much all logic of KStreams and more in these
>> APIs.
>>
>>
>>
>> Best, Fabian
>>
>>
>>
>>
>>
>> Am Di., 23. Juli 2019 um 13:06 Uhr schrieb Maatary Okouya <
>> maatarioko...@gmail.com>:
>>
>> I would like to have a KTable, or maybe in Flink term a dynamic Table,
>> that only contains the latest value for each keyed record. This would allow
>> me to perform aggregation and join, based on the latest state of every
>> record, as opposed to every record over time, or a period of time.
>>
>>
>>
>> On Sun, Jul 21, 2019 at 8:21 AM miki haiat <miko5...@gmail.com> wrote:
>>
>> Can you elaborate more  about your use case .
>>
>>
>>
>> On Sat, Jul 20, 2019 at 1:04 AM Maatary Okouya <maatarioko...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>>
>>
>> I am a user of Kafka Stream so far. However, because i have been face
>> with several limitation in particular in performing Join on KTable.
>>
>>
>>
>> I was wondering what is the appraoch in Flink to achieve  (1) the concept
>> of KTable, i.e. a Table that represent a changeLog, i.e. only the latest
>> version of all keyed records,  and (2) joining those.
>>
>>
>>
>> There are currently a lot of limitation around that on Kafka Stream, and
>> i need that for performing some ETL process, where i need to mirror entire
>> databases in Kafka, and then do some join on the table to emit the logical
>> entity in Kafka Topics. I was hoping that somehow i could acheive that by
>> using FLink as intermediary.
>>
>>
>>
>> I can see that you support any kind of join, but i just don't see the
>> notion of Ktable.
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy. Your privacy is important to us.
>> Accenture uses your personal data only in compliance with data protection
>> laws. For further information on how Accenture processes your personal
>> data, please see our privacy statement at
>> https://www.accenture.com/us-en/privacy-policy.
>>
>> ______________________________________________________________________________________
>>
>> www.accenture.com
>>
>>

Re: [External] Re: From Kafka Stream to Flink

Reply via email to