Re: Implementing samza table/stream join

Nick Dimiduk Tue, 10 Nov 2015 17:09:04 -0800

Brilliant Fabian, thanks a lot! This looks exactly like what I'm after. One
thing: the DatStream API I'm using (0.9.1) does not have a keyBy() method.
Presumably this is from newer API?


On Tue, Nov 10, 2015 at 1:11 PM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Nick,
>
> I think you can do this with Flink quite similar to how it is explained in
> the Samza documentation by using a stateful CoFlatMapFunction [1], [2].
>
> Please have a look at this snippet [3].
> This code implements an updateable stream filter. The first stream is
> filtered by words from the second stream. The filter operator adds or
> removes words to/from the filter which are received from the second stream.
> Both flows are partitioned by the filter word (or join key) such that each
> parallel task instance is only responsible for a subset of the filter
> words.
>
> Please let me know if you have questions.
>
> Best,
> Fabian
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#transformations
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#checkpointing-local-variables
> [3] https://gist.github.com/fhueske/4ea5422edb5820915fa4
>
> <https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#using-the-keyvalue-state-interface>
>
>
>
> 2015-11-10 19:02 GMT+01:00 Nick Dimiduk <ndimi...@gmail.com>:
>
>> Hello,
>>
>> I'm interested in implementing a table/stream join, very similar to what
>> is described in the "Table-stream join" section of the Samza key-value
>> state documentation [0]. Conceptually, this would be an extension of the
>> example provided in the javadocs for RichFunction#open [1], where I have a
>> dataset of searchStrings instead of a single one. As per the Samza
>> explanation, I would like to receive updates to this dataset via an
>> operation log (a la, kafka topic), so as to update my local state while the
>> streaming job runs.
>>
>> Perhaps you can further advise on parallelization strategy for this
>> operation. It seems to me that I'd want to partition the searchString
>> database across multiple parallelization units and broadcast my input
>> datastream to all those units. The idea being to maximize throughput on
>> available hardware, though I would expect there to be a limit at which the
>> network plane becomes a bottleneck to the broadcast.
>>
>> Is there an example of how I might implement this in Flink-Streaming? I
>> thought perhaps the DataStream#cross transformation would work, but I
>> haven't worked out how to use it to my purpose. Thus far, I'm using the
>> Java API.
>>
>> Thanks a lot!
>> -n
>>
>> [0]:
>> http://samza.apache.org/learn/documentation/0.9/container/state-management.html
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-release-0.9/api/java/org/apache/flink/api/common/functions/RichFunction.html#open(org.apache.flink.configuration.Configuration)
>>
>
>

Re: Implementing samza table/stream join

Reply via email to