Hello Flink Users, Any views on this question of mine.
Thanks, Hemant On Tue, May 12, 2020 at 7:00 PM hemant singh <hemant2...@gmail.com> wrote: > Hello Roman, > > Thanks for your response. > > I think partitioning you described (event type + protocol type) is subject > to data skew. Including a device ID should solve this problem. > Also, including "protocol_type" into the key and having topic per > protocol_type seems redundant. > Each protocol is in single topic and event_type is key to distribute data > to a specific partition. > > Furthermore, do you have any particular reason to maintain multiple > topics? > I could imagine protocols have different speeds or other characteristics, > so you can tune Flink accordingly. > Otherwise, having a single topic partitioned only by device ID would > simplify deployment and reduce data skew. > Yes, you are right. These protocols have separate characteristics like > speed, data format. If I do have only one topic with data partitioned by > device_id then it could be that events from faster protocol is processed > faster and the joins which I want to do will not have enough matching data. > I have a question here how are you referring to tune Flink to handle > different characteristics like speed of streams as reading from kafka could > result in uneven processing of data? > > > By consume do you mean the downstream system? > My downstream is TSDB and other DBs where the data will be written to. All > these is time-series data. > > Thanks, > Hemant > > > > On Tue, May 12, 2020 at 5:28 PM Khachatryan Roman < > khachatryan.ro...@gmail.com> wrote: > >> Hello Hemant, >> >> Thanks for your reply. >> >> I think partitioning you described (event type + protocol type) is >> subject to data skew. Including a device ID should solve this problem. >> Also, including "protocol_type" into the key and having topic per >> protocol_type seems redundant. >> >> Furthermore, do you have any particular reason to maintain multiple >> topics? >> I could imagine protocols have different speeds or other characteristics, >> so you can tune Flink accordingly. >> Otherwise, having a single topic partitioned only by device ID would >> simplify deployment and reduce data skew. >> >> > By consume do you mean the downstream system? >> Yes. >> >> Regards, >> Roman >> >> >> On Mon, May 11, 2020 at 11:30 PM hemant singh <hemant2...@gmail.com> >> wrote: >> >>> Hello Roman, >>> >>> PFB my response - >>> >>> As I understand, each protocol has a distinct set of event types (where >>> event type == metrics type); and a distinct set of devices. Is this correct? >>> Yes, correct. distinct events and devices. Each device emits these event. >>> >>> > Based on data protocol I have 4-5 topics. Currently the data for a >>> single event is being pushed to a partition of the kafka topic(producer key >>> -> event_type + data_protocol). >>> Here you are talking about the source (to Flink job), right? >>> Yes, you are right. >>> >>> Can you also share how are you going to consume these data? >>> By consume do you mean the downstream system? >>> If yes then this data will be written to a DB, some metrics goes to >>> TSDB(Influx) as well. >>> >>> Thanks, >>> Hemant >>> >>> On Tue, May 12, 2020 at 2:08 AM Khachatryan Roman < >>> khachatryan.ro...@gmail.com> wrote: >>> >>>> Hi Hemant, >>>> >>>> As I understand, each protocol has a distinct set of event types (where >>>> event type == metrics type); and a distinct set of devices. Is this >>>> correct? >>>> >>>> > Based on data protocol I have 4-5 topics. Currently the data for a >>>> single event is being pushed to a partition of the kafka topic(producer key >>>> -> event_type + data_protocol). >>>> Here you are talking about the source (to Flink job), right? >>>> >>>> Can you also share how are you going to consume these data? >>>> >>>> >>>> Regards, >>>> Roman >>>> >>>> >>>> On Mon, May 11, 2020 at 8:57 PM hemant singh <hemant2...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have different events from a device which constitutes different >>>>> metrics for same device. Each of these event is produced by the device in >>>>> interval of few milli seconds to a minute. >>>>> >>>>> Event1(Device1) -> Stream1 -> Metric 1 >>>>> Event2 (Device1) -> Stream2 -> Metric 2 ... >>>>> .............. >>>>> ....... >>>>> Event100(Device1) -> Stream100 -> Metric100 >>>>> >>>>> The number of events can go up to few 100s for each data protocol and >>>>> we have around 4-5 data protocols. Metrics from different streams makes up >>>>> a records >>>>> like for example from above example for device 1 - >>>>> >>>>> Device1 -> Metric1, Metric 2, Metric15 forms a single record for the >>>>> device. Currently in development phase I am using interval join to achieve >>>>> this, that is to create a record with latest data from different >>>>> streams(events). >>>>> >>>>> Based on data protocol I have 4-5 topics. Currently the data for a >>>>> single event is being pushed to a partition of the kafka topic(producer >>>>> key >>>>> -> event_type + data_protocol). So essentially one topic is made up of >>>>> many >>>>> streams. I am filtering on the key to define the streams. >>>>> >>>>> My question is - Is this correct way to stream the data, I had thought >>>>> of maintaining different topic for an event, however in that case number >>>>> of >>>>> topics could go to few thousands and that is something which becomes >>>>> little >>>>> challenging to maintain and not sure if kafka handles that well. >>>>> >>>>> I know there are traditional ways to do this like pushing it to >>>>> timeseries db and then joining data for different metric but that is >>>>> something which will never scale, also this processing should be as >>>>> realtime as possible. >>>>> >>>>> Are there better ways to handle this use case or I am on correct path. >>>>> >>>>> Thanks, >>>>> Hemant >>>>> >>>>