Thanks for all the ideas!! From: Steven Wu [mailto:stevenz...@gmail.com] Sent: Tuesday, February 06, 2018 3:46 AM To: Stefan Richter <s.rich...@data-artisans.com> Cc: Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.com>; user@flink.apache.org; Aljoscha Krettek <aljos...@apache.org> Subject: Re: Joining data in Streaming
There is also a discussion of side input https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_FLINK_FLIP-2D17-2BSide-2BInputs-2Bfor-2BDataStream-2BAPI&d=DwMFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=DINQCP5lFkUWfTJoHDmE5FW_lPp8zfNUGtYzkDvz9sY&s=sGwpvLXj8T7thBIZ1SsrpuohRFOkcl6bFcl9L49iRgM&e=> I would load the smaller data set as static reference data set. Then you can just do single source streaming of the larger data set. On Wed, Jan 31, 2018 at 1:09 AM, Stefan Richter <s.rich...@data-artisans.com<mailto:s.rich...@data-artisans.com>> wrote: Hi, if the workarounds that Xingcan and me mentioned are no options for your use-case, then I think this might currently be the better option. But I would expect some better support for stream joins in the near future. Best, Stefan > Am 31.01.2018 um 07:04 schrieb Marchant, Hayden > <hayden.march...@citi.com<mailto:hayden.march...@citi.com>>: > > Stefan, > > So are we essentially saying that in this case, for now, I should stick to > DataSet / Batch Table API? > > Thanks, > Hayden > > -----Original Message----- > From: Stefan Richter > [mailto:s.rich...@data-artisans.com<mailto:s.rich...@data-artisans.com>] > Sent: Tuesday, January 30, 2018 4:18 PM > To: Marchant, Hayden [ICG-IT] > <hm97...@imceu.eu.ssmb.com<mailto:hm97...@imceu.eu.ssmb.com>> > Cc: user@flink.apache.org<mailto:user@flink.apache.org>; Aljoscha Krettek > <aljos...@apache.org<mailto:aljos...@apache.org>> > Subject: Re: Joining data in Streaming > > Hi, > > as far as I know, this is not easily possible. What would be required is > something like a CoFlatmap function, where one input stream is blocking until > the second stream is fully consumed to build up the state to join against. > Maybe Aljoscha (in CC) can comment on future plans to support this. > > Best, > Stefan > >> Am 30.01.2018 um 12:42 schrieb Marchant, Hayden >> <hayden.march...@citi.com<mailto:hayden.march...@citi.com>>: >> >> We have a use case where we have 2 data sets - One reasonable large data set >> (a few million entities), and a smaller set of data. We want to do a join >> between these data sets. We will be doing this join after both data sets are >> available. In the world of batch processing, this is pretty straightforward >> - we'd load both data sets into an application and execute a join operator >> on them through a common key. Is it possible to do such a join using the >> DataStream API? I would assume that I'd use the connect operator, though I'm >> not sure exactly how I should do the join - do I need one 'smaller' set to >> be completely loaded into state before I start flowing the large set? My >> concern is that if I read both data sets from streaming sources, since I >> can't be guaranteed of the order that the data is loaded, I may lose lots of >> potential joined entities since their pairs might not have been read yet. >> >> >> Thanks, >> Hayden Marchant >> >> >