Thank you Stephan for the information! On Mon, Dec 14, 2015 at 5:20 AM Stephan Ewen <se...@apache.org> wrote:
> Hi Alex! > > Right now, Flink would not reuse Kafka's partitioning for joins, but > shuffle/partition data by itself. Flink is very fast at shuffling and adds > very little latency on shuffles, so that is usually not an issue. The > reason that design is that we view streaming program as something dynamic: > Kafka partitions may be added or removed during a program's life time, and > the parallelism (and with that the partitioning scheme) can change as well. > With Flink handling the partitioning internally, these cases are all > covered. > > Concerning the join: The built-in join is definitely able to handle > millions or records in a window, and scales well. What it does is windowing > the streams together and joining within the windows. If you want responses > within a second, you should make the window small enough that it evaluated > every 500ms or so. > > If you want super low latency joins, you can look into using custom state > to do that. With that, you could build your custom symmetric hash join for > example. That has virtually zero latency and you can control how long each > side keeps the data. > > Concerning key lookups vs shuffling: The shuffle variant is usually much > faster, because it uses network better. The shuffle is fully pipelined, > many records are in shuffle at the same time, it is optimized for > throughput and can still keep the latency quite low ( > http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ > ). > In contrast, key lookups that avoid shuffling usually take a bit of time > (millisecond or so) and limit any throughput a lot because they involve > many smaller messages and add even more latency (roundtrip between nodes, > rather than one way). > > > Hope that this answers your question, let me know if you have more > questions! > > Greetings, > Stephan > > > On Fri, Dec 11, 2015 at 4:00 PM, Alex Rovner <alex.rov...@magnetic.com> > wrote: > >> Hello all, >> >> I was wondering if someone would be kind enough to enlighten me on a few >> topics. We are trying to join two streams of data on a key. We were >> thinking of partitioning topics in Kafka by the key, however I also saw >> that Flink is able to partition on its own and I was wondering whether >> Flink can take advantage of Kafka's partitioning and/or which partitioning >> scheme should I go for? >> >> As far as joins, our two datasets are very large (millions of records in >> each window) and we need to perform the joins very quickly (less than 1 >> sec). Would the built in join mechanism be sufficient for this? From what I >> understand it would need to shuffle data and thus be slow on large >> datasets? I was wondering if there is a way to join via state key value >> lookups to avoid the shuffling? >> >> I read the docs and the blogs so far, thus have some limited >> understanding of how Flink works, no practical experience though. >> >> Thanks >> >> >> *Alex Rovner* >> *Director, Data Engineering * >> *o:* 646.759.0052 >> >> * <http://www.magnetic.com/>* >> > > -- *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>*