Re: Implementing samza table/stream join

Robert Metzger Wed, 11 Nov 2015 01:11:47 -0800

In Flink 0.9.1 keyBy is called "groupBy()". We've reworked the DataStream
API between 0.9 and 0.10, that's why we had to rename the method.


On Wed, Nov 11, 2015 at 9:37 AM, Stephan Ewen <se...@apache.org> wrote:

> I would encourage you to use the 0.10 version of Flink. Streaming has made
> some major improvements there.
>
> The release is voted on now, you can refer to these repositories for the
> release candidate code:
>
> http://people.apache.org/~mxm/flink-0.10.0-rc8/
>
> https://repository.apache.org/content/repositories/orgapacheflink-1055/
>
> Greetings,
> Stephan
>
>
> On Wed, Nov 11, 2015 at 2:07 AM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>
>> Brilliant Fabian, thanks a lot! This looks exactly like what I'm after.
>> One thing: the DatStream API I'm using (0.9.1) does not have a keyBy()
>> method. Presumably this is from newer API?
>>
>> On Tue, Nov 10, 2015 at 1:11 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>>
>>> Hi Nick,
>>>
>>> I think you can do this with Flink quite similar to how it is explained
>>> in the Samza documentation by using a stateful CoFlatMapFunction [1], [2].
>>>
>>> Please have a look at this snippet [3].
>>> This code implements an updateable stream filter. The first stream is
>>> filtered by words from the second stream. The filter operator adds or
>>> removes words to/from the filter which are received from the second stream.
>>> Both flows are partitioned by the filter word (or join key) such that each
>>> parallel task instance is only responsible for a subset of the filter
>>> words.
>>>
>>> Please let me know if you have questions.
>>>
>>> Best,
>>> Fabian
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#transformations
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#checkpointing-local-variables
>>> [3] https://gist.github.com/fhueske/4ea5422edb5820915fa4
>>>
>>> <https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#using-the-keyvalue-state-interface>
>>>
>>>
>>>
>>> 2015-11-10 19:02 GMT+01:00 Nick Dimiduk <ndimi...@gmail.com>:
>>>
>>>> Hello,
>>>>
>>>> I'm interested in implementing a table/stream join, very similar to
>>>> what is described in the "Table-stream join" section of the Samza key-value
>>>> state documentation [0]. Conceptually, this would be an extension of the
>>>> example provided in the javadocs for RichFunction#open [1], where I have a
>>>> dataset of searchStrings instead of a single one. As per the Samza
>>>> explanation, I would like to receive updates to this dataset via an
>>>> operation log (a la, kafka topic), so as to update my local state while the
>>>> streaming job runs.
>>>>
>>>> Perhaps you can further advise on parallelization strategy for this
>>>> operation. It seems to me that I'd want to partition the searchString
>>>> database across multiple parallelization units and broadcast my input
>>>> datastream to all those units. The idea being to maximize throughput on
>>>> available hardware, though I would expect there to be a limit at which the
>>>> network plane becomes a bottleneck to the broadcast.
>>>>
>>>> Is there an example of how I might implement this in Flink-Streaming? I
>>>> thought perhaps the DataStream#cross transformation would work, but I
>>>> haven't worked out how to use it to my purpose. Thus far, I'm using the
>>>> Java API.
>>>>
>>>> Thanks a lot!
>>>> -n
>>>>
>>>> [0]:
>>>> http://samza.apache.org/learn/documentation/0.9/container/state-management.html
>>>> [1]:
>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.9/api/java/org/apache/flink/api/common/functions/RichFunction.html#open(org.apache.flink.configuration.Configuration)
>>>>
>>>
>>>
>>
>

Re: Implementing samza table/stream join

Reply via email to