Sorry I forgot to paste the reference url. Best, Kurt
[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/joins.html#join-with-a-temporal-table-function [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/joins.html#join-with-a-temporal-table On Fri, Dec 13, 2019 at 4:37 PM Kurt Young <ykt...@gmail.com> wrote: > Hi Krzysztof, > > What you raised also interested us a lot to achieve in Flink. > Unfortunately, there > is no in place solution in Table/SQL API yet, but you have 2 options which > are both > close to this thus need some modifications. > > 1. The first one is use temporal table function [1]. It needs you to write > the logic of > reading hive tables and do the daily update inside the table function. > 2. The second choice is to use temporal table join [2], which only works > with processing > time now (just like the simple solution you mentioned), and need the table > source has > look up capability (like hbase). Currently, hive connector doesn't support > look up, so to > make this work, you need to sync the content to other storages which > support look up, > like HBase. > > Both solutions are not ideal now, and we also aims to improve this maybe > in the following > release. > > Best, > Kurt > > > On Fri, Dec 13, 2019 at 1:44 AM Krzysztof Zarzycki <k.zarzy...@gmail.com> > wrote: > >> Hello dear Flinkers, >> If this kind of question was asked on the groups, I'm sorry for a >> duplicate. Feel free to just point me to the thread. >> I have to solve a probably pretty common case of joining a datastream to >> a dataset. >> Let's say I have the following setup: >> * I have a high pace stream of events coming in Kafka. >> * I have some dimension tables stored in Hive. These tables are changed >> daily. I can keep a snapshot for each day. >> >> Now conceptually, I would like to join the stream of incoming events to >> the dimension tables (simple hash join). we can consider two cases: >> 1) simpler, where I join the stream with the most recent version of the >> dictionaries. (So the result is accepted to be nondeterministic if the job >> is retried). >> 2) more advanced, where I would like to do temporal join of the stream >> with dictionaries snapshots that were valid at the time of the event. (This >> result should be deterministic). >> >> The end goal is to do aggregation of that joined stream, store results in >> Hive or more real-time analytical store (Druid). >> >> Now, could you please help me understand is any of these cases >> implementable with declarative Table/SQL API? With use of temporal joins, >> catalogs, Hive integration, JDBC connectors, or whatever beta features >> there are now. (I've read quite a lot of Flink docs about each of those, >> but I have a problem to compile this information in the final design.) >> Could you please help me understand how these components should >> cooperate? >> If that is impossible with Table API, can we come up with the easiest >> implementation using Datastream API ? >> >> Thanks a lot for any help! >> Krzysztof >> >