Re: Kafka KeyedStream source

2017-01-18 Thread Fabian Hueske
Hi Niels, I was more talking from a theoretical point of view. Flink does not have a hook to inject a custom hash function (yet). I'm not familiar with the details of the implementation to make an assessment whether this would be possible or how much work it would be. However, several users have a

Re: Kafka KeyedStream source

2017-01-18 Thread Niels Basjes
Hi, > However, if you would like to keyBy the original key attribute, Flink would need to have access to the hash function that was used to assign events to partitions. So if my producing application and my consuming application use the same source attributes AND the same hashing function to dete

Re: Kafka KeyedStream source

2017-01-16 Thread Fabian Hueske
Hi Niels, I think the biggest problem for keyed sources is that Flink must be able to co-locate key-partitioned state with the pre-partitioned data. This might work, if the key is the partition ID, i.e, not the original key attribue that was hashed to assign events to partitions. Flink could need

Re: Kafka KeyedStream source

2017-01-15 Thread Tzu-Li (Gordon) Tai
Hi Niels, If it’s only for simple data filtering that does not depend on the key, a simple “flatMap” or “filter" directly after the source can be chained to the source instances. What that does is that the filter processing will be done within the same thread as the one fetching data from a Kaf

Re: Kafka KeyedStream source

2017-01-11 Thread Niels Basjes
Hi, Ok. I think I get it. WHAT IF: Assume we create a addKeyedSource(...) which will allow us to add a source that makes some guarantees about the data. And assume this source returns simply the Kafka partition id as the result of this 'hash' function. Then if I have 10 kafka partitions I would r

Re: Kafka KeyedStream source

2017-01-09 Thread Tzu-Li (Gordon) Tai
Hi Niels, Thank you for bringing this up. I recall there was some previous discussion related to this before: [1]. I don’t think this is possible at the moment, mainly because of how the API is designed. On the other hand, a KeyedStream in Flink is basically just a DataStream with a hash part

Kafka KeyedStream source

2017-01-05 Thread Niels Basjes
Hi, In my scenario I have click stream data that I persist in Kafka. I use the sessionId as the key to instruct Kafka to put everything with the same sessionId into the same Kafka partition. That way I already have all events of a visitor in a single kafka partition in a fixed order. When I read