Re: Join a datastream with tables stored in Hive

Kurt Young Fri, 13 Dec 2019 00:39:22 -0800

Sorry I forgot to paste the reference url.

Best,
Kurt


[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/joins.html#join-with-a-temporal-table-function
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/joins.html#join-with-a-temporal-table

On Fri, Dec 13, 2019 at 4:37 PM Kurt Young <ykt...@gmail.com> wrote:

> Hi Krzysztof,
>
> What you raised also interested us a lot to achieve in Flink.
> Unfortunately, there
> is no in place solution in Table/SQL API yet, but you have 2 options which
> are both
> close to this thus need some modifications.
>
> 1. The first one is use temporal table function [1]. It needs you to write
> the logic of
> reading hive tables and do the daily update inside the table function.
> 2. The second choice is to use temporal table join [2], which only works
> with processing
> time now (just like the simple solution you mentioned), and need the table
> source has
> look up capability (like hbase). Currently, hive connector doesn't support
> look up, so to
> make this work, you need to sync the content to other storages which
> support look up,
> like HBase.
>
> Both solutions are not ideal now, and we also aims to improve this maybe
> in the following
> release.
>
> Best,
> Kurt
>
>
> On Fri, Dec 13, 2019 at 1:44 AM Krzysztof Zarzycki <k.zarzy...@gmail.com>
> wrote:
>
>> Hello dear Flinkers,
>> If this kind of question was asked on the groups, I'm sorry for a
>> duplicate. Feel free to just point me to the thread.
>> I have to solve a probably pretty common case of joining a datastream to
>> a dataset.
>> Let's say I have the following setup:
>> * I have a high pace stream of events coming in Kafka.
>> * I have some dimension tables stored in Hive. These tables are changed
>> daily. I can keep a snapshot for each day.
>>
>> Now conceptually, I would like to join the stream of incoming events to
>> the dimension tables (simple hash join). we can consider two cases:
>> 1) simpler, where I join the stream with the most recent version of the
>> dictionaries. (So the result is accepted to be nondeterministic if the job
>> is retried).
>> 2) more advanced, where I would like to do temporal join of the stream
>> with dictionaries snapshots that were valid at the time of the event. (This
>> result should be deterministic).
>>
>> The end goal is to do aggregation of that joined stream, store results in
>> Hive or more real-time analytical store (Druid).
>>
>> Now, could you please help me understand is any of these cases
>> implementable with declarative Table/SQL API? With use of temporal joins,
>> catalogs, Hive integration, JDBC connectors, or whatever beta features
>> there are now. (I've read quite a lot of Flink docs about each of those,
>> but I have a problem to compile this information in the final design.)
>> Could you please help me understand how these components should
>> cooperate?
>> If that is impossible with Table API, can we come up with the easiest
>> implementation using Datastream API ?
>>
>> Thanks a lot for any help!
>> Krzysztof
>>
>

Re: Join a datastream with tables stored in Hive

Reply via email to