Re: Need suggestion on Flink-Kafka stream processing design

hemant singh Tue, 12 May 2020 12:05:49 -0700

Hello Flink Users,

Any views on this question of mine.


Thanks,
Hemant

On Tue, May 12, 2020 at 7:00 PM hemant singh <hemant2...@gmail.com> wrote:

> Hello Roman,
>
> Thanks for your response.
>
> I think partitioning you described (event type + protocol type) is subject
> to data skew. Including a device ID should solve this problem.
> Also, including "protocol_type" into the key and having topic per
> protocol_type seems redundant.
> Each protocol is in single topic and event_type is key to distribute data
> to a specific partition.
>
> Furthermore, do you have any particular reason to maintain multiple
> topics?
> I could imagine protocols have different speeds or other characteristics,
> so you can tune Flink accordingly.
> Otherwise, having a single topic partitioned only by device ID would
> simplify deployment and reduce data skew.
> Yes, you are right. These protocols have separate characteristics like
> speed, data format. If I do have only one topic with data partitioned by
> device_id then it could be that events from faster protocol is processed
> faster and the joins which I want to do will not have enough matching data.
> I have a question here how are you referring to tune Flink to handle
> different characteristics like speed of streams as reading from kafka could
> result in uneven processing of data?
>
> > By consume do you mean the downstream system?
> My downstream is TSDB and other DBs where the data will be written to. All
> these is time-series data.
>
> Thanks,
> Hemant
>
>
>
> On Tue, May 12, 2020 at 5:28 PM Khachatryan Roman <
> khachatryan.ro...@gmail.com> wrote:
>
>> Hello Hemant,
>>
>> Thanks for your reply.
>>
>> I think partitioning you described (event type + protocol type) is
>> subject to data skew. Including a device ID should solve this problem.
>> Also, including "protocol_type" into the key and having topic per
>> protocol_type seems redundant.
>>
>> Furthermore, do you have any particular reason to maintain multiple
>> topics?
>> I could imagine protocols have different speeds or other characteristics,
>> so you can tune Flink accordingly.
>> Otherwise, having a single topic partitioned only by device ID would
>> simplify deployment and reduce data skew.
>>
>> > By consume do you mean the downstream system?
>> Yes.
>>
>> Regards,
>> Roman
>>
>>
>> On Mon, May 11, 2020 at 11:30 PM hemant singh <hemant2...@gmail.com>
>> wrote:
>>
>>> Hello Roman,
>>>
>>> PFB my response -
>>>
>>> As I understand, each protocol has a distinct set of event types (where
>>> event type == metrics type); and a distinct set of devices. Is this correct?
>>> Yes, correct. distinct events and devices. Each device emits these event.
>>>
>>> > Based on data protocol I have 4-5 topics. Currently the data for a
>>> single event is being pushed to a partition of the kafka topic(producer key
>>> -> event_type + data_protocol).
>>> Here you are talking about the source (to Flink job), right?
>>> Yes, you are right.
>>>
>>> Can you also share how are you going to consume these data?
>>> By consume do you mean the downstream system?
>>> If yes then this data will be written to a DB, some metrics goes to
>>> TSDB(Influx) as well.
>>>
>>> Thanks,
>>> Hemant
>>>
>>> On Tue, May 12, 2020 at 2:08 AM Khachatryan Roman <
>>> khachatryan.ro...@gmail.com> wrote:
>>>
>>>> Hi Hemant,
>>>>
>>>> As I understand, each protocol has a distinct set of event types (where
>>>> event type == metrics type); and a distinct set of devices. Is this 
>>>> correct?
>>>>
>>>> > Based on data protocol I have 4-5 topics. Currently the data for a
>>>> single event is being pushed to a partition of the kafka topic(producer key
>>>> -> event_type + data_protocol).
>>>> Here you are talking about the source (to Flink job), right?
>>>>
>>>> Can you also share how are you going to consume these data?
>>>>
>>>>
>>>> Regards,
>>>> Roman
>>>>
>>>>
>>>> On Mon, May 11, 2020 at 8:57 PM hemant singh <hemant2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have different events from a device which constitutes different
>>>>> metrics for same device. Each of these event is produced by the device in
>>>>> interval of few milli seconds to a minute.
>>>>>
>>>>> Event1(Device1) -> Stream1 -> Metric 1
>>>>> Event2 (Device1) -> Stream2 -> Metric 2 ...
>>>>> ..............
>>>>> .......
>>>>> Event100(Device1) -> Stream100 -> Metric100
>>>>>
>>>>> The number of events can go up to few 100s for each data protocol and
>>>>> we have around 4-5 data protocols. Metrics from different streams makes up
>>>>> a records
>>>>> like for example from above example for device 1 -
>>>>>
>>>>> Device1 -> Metric1, Metric 2, Metric15 forms a single record for the
>>>>> device. Currently in development phase I am using interval join to achieve
>>>>> this, that is to create a record with latest data from different
>>>>> streams(events).
>>>>>
>>>>> Based on data protocol I have 4-5 topics. Currently the data for a
>>>>> single event is being pushed to a partition of the kafka topic(producer 
>>>>> key
>>>>> -> event_type + data_protocol). So essentially one topic is made up of 
>>>>> many
>>>>> streams. I am filtering on the key to define the streams.
>>>>>
>>>>> My question is - Is this correct way to stream the data, I had thought
>>>>> of maintaining different topic for an event, however in that case number 
>>>>> of
>>>>> topics could go to few thousands and that is something which becomes 
>>>>> little
>>>>> challenging to maintain and not sure if kafka handles that well.
>>>>>
>>>>> I know there are traditional ways to do this like pushing it to
>>>>> timeseries db and then joining data for different metric but that is
>>>>> something which will never scale, also this processing should be as
>>>>> realtime as possible.
>>>>>
>>>>> Are there better ways to handle this use case or I am on correct path.
>>>>>
>>>>> Thanks,
>>>>> Hemant
>>>>>
>>>>

Re: Need suggestion on Flink-Kafka stream processing design

Reply via email to