Hi, sounds like a cool project. What's the size of one data point? If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 gigabytes of state. That's something one or two machines (depending on the disk throughput) should be able to handle.
If possible, I would recommend you to do an experiment using a prototype to see how many machines you need for your workload. On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: > Hi Patrick, > > Thanks a lot for feedback on your use case! At a first glance, I would say > that Flink can definitely solve the issues you are evaluating. > > I’ll try to explain them, and point you to some docs / articles that can > further explain in detail: > > *- Lateness* > > The 7-day lateness shouldn’t be a problem. We definitely recommend > using RocksDB as the state backend for such a use case, as you > mentioned correctly, the state would be kept for a long time. > The heavy burst when your locally buffered data on machines are > sent to Kafka once they come back online shouldn’t be a problem either; > since Flink is a pure data streaming engine, it handles backpressure > naturally without any additional mechanisms (I would recommend > taking a look at http://data-artisans.com/how-flink-handles-backpressure/ > ). > > *- Out of Order* > > That’s exactly what event time processing is for :-) As long as the event > comes in before the allowed lateness for windows, the event will still fall > into its corresponding event time window. So, even with the heavy burst of > the your late machine data, they will still be aggregated in the correct > windows. > You can look into event time in Flink with more detail in the event time > docs: > https://ci.apache.org/projects/flink/flink-docs- > release-1.3/dev/event_time.html > > *- Last write wins* > > Your operators that does the aggregations simply need to be able to > reprocess > results if it sees an event with the same id come in. Now, if results are > sent out > of Flink and stored in an external db, if you can design the db writes to > be idempotent, > then it’ll effectively be a “last write wins”. It depends mostly on your > pipeline and > use case. > > *- Computations per minute* > I think you can simply do this by having two separate window operators. > One that works on your longer window, and another on a per-minute basis. > > Hope this helps! > > - Gordon > > On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (j...@kpibench.com) > wrote: > > Hello > > I've done my first steps with Flink and i am very impressed of its > capabilities. Thank you for that :) I want to use it for a project we are > currently working on. After reading some documentation > i am not sure if it's the right tool for the job. We have an IoT > application in which we are monitoring machines in production plants. The > machines have sensors attached and they are sending > their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute > basis. > > Following requirements must be fulfilled > > > - Lateness > > We have to allow lateness for 7 days because machines can have down > time due network issues, maintenance or something else. If thats the case > buffering of data happens localy on the machine and once they > are online again all data will be sent to the broker. This can result > in some relly heavy burst. > > > - Out of order > > Events come out of order due this lateness issues > > > - Last write wins > > Machines are not stateful and can not guarantee exactly once sending > of their data. It can happen that sometimes events are sent twice. In that > case the last event wins and should override the previous one. > Events are unique due a sensor_id and a timestamp > > - Computations per minute > > We can not wait until the windows ends and have to do computations on > a per minute basis. For example aggregating data per sensor and writing it > to a db > > > My biggest concern in that case is the huge lateness. Keeping data for 7 > days would result in 10080 data points for just one sensor! Multiplying > that by 10.000 sensors would result in 100800000 datapoints which Flink > would have to handle in its state. The number of sensors are constantly > growing so will the number of data points > > So my questions are > > > - Is Flink the right tool for the Job ? > > - Is that lateness an issue ? > > - How can i implement the Last write wins ? > > - How to tune flink to handle that growing load of sensors and data > points ? > > - Hardware requirements, storage and memory size ? > > > > I don't want to maintain two code base for batch and streaming because the > operations are all equal. The only difference is the time range! Thats the > reason i wanted to do all this with Flink Streaming. > > Hope you can guide me in the right direction > > Thx > > > > > > > >