RE: Joining data in Streaming

Marchant, Hayden Wed, 07 Feb 2018 06:46:34 -0800

Thanks for all the ideas!!

From: Steven Wu [mailto:stevenz...@gmail.com]
Sent: Tuesday, February 06, 2018 3:46 AM
To: Stefan Richter <s.rich...@data-artisans.com>
Cc: Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.com>; 
user@flink.apache.org; Aljoscha Krettek <aljos...@apache.org>
Subject: Re: Joining data in Streaming


There is also a discussion of side input
https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_FLINK_FLIP-2D17-2BSide-2BInputs-2Bfor-2BDataStream-2BAPI&d=DwMFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=DINQCP5lFkUWfTJoHDmE5FW_lPp8zfNUGtYzkDvz9sY&s=sGwpvLXj8T7thBIZ1SsrpuohRFOkcl6bFcl9L49iRgM&e=>

I would load the smaller data set as static reference data set. Then you can 
just do single source streaming of the larger data set.

On Wed, Jan 31, 2018 at 1:09 AM, Stefan Richter 
<s.rich...@data-artisans.com<mailto:s.rich...@data-artisans.com>> wrote:
Hi,

if the workarounds that Xingcan and me mentioned are no options for your 
use-case, then I think this might currently be the better option. But I would 
expect some better support for stream joins in the near future.

Best,
Stefan

> Am 31.01.2018 um 07:04 schrieb Marchant, Hayden 
> <hayden.march...@citi.com<mailto:hayden.march...@citi.com>>:
>
> Stefan,
>
> So are we essentially saying that in this case, for now, I should stick to 
> DataSet / Batch Table API?
>
> Thanks,
> Hayden
>
> -----Original Message-----
> From: Stefan Richter 
> [mailto:s.rich...@data-artisans.com<mailto:s.rich...@data-artisans.com>]
> Sent: Tuesday, January 30, 2018 4:18 PM
> To: Marchant, Hayden [ICG-IT] 
> <hm97...@imceu.eu.ssmb.com<mailto:hm97...@imceu.eu.ssmb.com>>
> Cc: user@flink.apache.org<mailto:user@flink.apache.org>; Aljoscha Krettek 
> <aljos...@apache.org<mailto:aljos...@apache.org>>
> Subject: Re: Joining data in Streaming
>
> Hi,
>
> as far as I know, this is not easily possible. What would be required is 
> something like a CoFlatmap function, where one input stream is blocking until 
> the second stream is fully consumed to build up the state to join against. 
> Maybe Aljoscha (in CC) can comment on future plans to support this.
>
> Best,
> Stefan
>
>> Am 30.01.2018 um 12:42 schrieb Marchant, Hayden 
>> <hayden.march...@citi.com<mailto:hayden.march...@citi.com>>:
>>
>> We have a use case where we have 2 data sets - One reasonable large data set 
>> (a few million entities), and a smaller set of data. We want to do a join 
>> between these data sets. We will be doing this join after both data sets are 
>> available.  In the world of batch processing, this is pretty straightforward 
>> - we'd load both data sets into an application and execute a join operator 
>> on them through a common key.   Is it possible to do such a join using the 
>> DataStream API? I would assume that I'd use the connect operator, though I'm 
>> not sure exactly how I should do the join - do I need one 'smaller' set to 
>> be completely loaded into state before I start flowing the large set? My 
>> concern is that if I read both data sets from streaming sources, since I 
>> can't be guaranteed of the order that the data is loaded, I may lose lots of 
>> potential joined entities since their pairs might not have been read yet.
>>
>>
>> Thanks,
>> Hayden Marchant
>>
>>
>

RE: Joining data in Streaming

Reply via email to