Hi, 100 million rows is small load, especially for 1 week. I suspect that your load would be quite evenly distributed during the day as it's plant not humans. If you look for reliability, make 2 Kafka servers at least where each topic has 6 partitions. And separate Hadoop cluster for Flink.
As for duplicate messages, it's not a problem of Flink or Cassandra. It's a logical problem, i.e. it's up to you how to achieve exactly once semantics. I advise you to use some storage anyway for reliability and failover. Sincerely yours, Timur Shenkao On Friday, February 24, 2017, Patrick Brunmayr <j...@kpibench.com> wrote: > Hi > > Yes it is and would be nice to handle this with Flink :) > > - > *Size of data point* > The size of a data point is basically just a simple case class with two > fields in a 64bit OS > > case class MachineData(sensorId: String, eventTime:Long) > > *- Last write wins* > > We have cassandra as data warehouse but i was hoping i could solve that > issue in the state level rather than in the db level. The reason beeing is > one could send me the same events > over and over again and this will cause that state to blow up until out of > memory. Secondly by doing aggregations per sensor results will be wrong due > multiple events with the same > timestamp. > > thx > > > > > > 2017-02-24 17:47 GMT+01:00 Robert Metzger <rmetz...@apache.org > <javascript:_e(%7B%7D,'cvml','rmetz...@apache.org');>>: > >> Hi, >> sounds like a cool project. >> >> What's the size of one data point? >> If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 >> gigabytes of state. That's something one or two machines (depending on the >> disk throughput) should be able to handle. >> >> If possible, I would recommend you to do an experiment using a prototype >> to see how many machines you need for your workload. >> >> On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org >> <javascript:_e(%7B%7D,'cvml','tzuli...@apache.org');>> wrote: >> >>> Hi Patrick, >>> >>> Thanks a lot for feedback on your use case! At a first glance, I would >>> say that Flink can definitely solve the issues you are evaluating. >>> >>> I’ll try to explain them, and point you to some docs / articles that can >>> further explain in detail: >>> >>> *- Lateness* >>> >>> The 7-day lateness shouldn’t be a problem. We definitely recommend >>> using RocksDB as the state backend for such a use case, as you >>> mentioned correctly, the state would be kept for a long time. >>> The heavy burst when your locally buffered data on machines are >>> sent to Kafka once they come back online shouldn’t be a problem either; >>> since Flink is a pure data streaming engine, it handles backpressure >>> naturally without any additional mechanisms (I would recommend >>> taking a look at http://data-artisans.com/how-f >>> link-handles-backpressure/). >>> >>> *- Out of Order* >>> >>> That’s exactly what event time processing is for :-) As long as the event >>> comes in before the allowed lateness for windows, the event will still >>> fall >>> into its corresponding event time window. So, even with the heavy burst >>> of >>> the your late machine data, they will still be aggregated in the correct >>> windows. >>> You can look into event time in Flink with more detail in the event time >>> docs: >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/ >>> dev/event_time.html >>> >>> *- Last write wins* >>> >>> Your operators that does the aggregations simply need to be able to >>> reprocess >>> results if it sees an event with the same id come in. Now, if results >>> are sent out >>> of Flink and stored in an external db, if you can design the db writes >>> to be idempotent, >>> then it’ll effectively be a “last write wins”. It depends mostly on your >>> pipeline and >>> use case. >>> >>> *- Computations per minute* >>> I think you can simply do this by having two separate window operators. >>> One that works on your longer window, and another on a per-minute basis. >>> >>> Hope this helps! >>> >>> - Gordon >>> >>> On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (j...@kpibench.com >>> <javascript:_e(%7B%7D,'cvml','j...@kpibench.com');>) wrote: >>> >>> Hello >>> >>> I've done my first steps with Flink and i am very impressed of its >>> capabilities. Thank you for that :) I want to use it for a project we are >>> currently working on. After reading some documentation >>> i am not sure if it's the right tool for the job. We have an IoT >>> application in which we are monitoring machines in production plants. The >>> machines have sensors attached and they are sending >>> their data to a broker ( Kafka, Azure Iot Hub ) currently on a per >>> minute basis. >>> >>> Following requirements must be fulfilled >>> >>> >>> - Lateness >>> >>> We have to allow lateness for 7 days because machines can have down >>> time due network issues, maintenance or something else. If thats the case >>> buffering of data happens localy on the machine and once they >>> are online again all data will be sent to the broker. This can >>> result in some relly heavy burst. >>> >>> >>> - Out of order >>> >>> Events come out of order due this lateness issues >>> >>> >>> - Last write wins >>> >>> Machines are not stateful and can not guarantee exactly once sending >>> of their data. It can happen that sometimes events are sent twice. In >>> that >>> case the last event wins and should override the previous one. >>> Events are unique due a sensor_id and a timestamp >>> >>> - Computations per minute >>> >>> We can not wait until the windows ends and have to do computations >>> on a per minute basis. For example aggregating data per sensor and >>> writing >>> it to a db >>> >>> >>> My biggest concern in that case is the huge lateness. Keeping data for 7 >>> days would result in 10080 data points for just one sensor! Multiplying >>> that by 10.000 sensors would result in 100800000 datapoints which Flink >>> would have to handle in its state. The number of sensors are constantly >>> growing so will the number of data points >>> >>> So my questions are >>> >>> >>> - Is Flink the right tool for the Job ? >>> >>> - Is that lateness an issue ? >>> >>> - How can i implement the Last write wins ? >>> >>> - How to tune flink to handle that growing load of sensors and data >>> points ? >>> >>> - Hardware requirements, storage and memory size ? >>> >>> >>> >>> I don't want to maintain two code base for batch and streaming because >>> the operations are all equal. The only difference is the time range! Thats >>> the reason i wanted to do all this with Flink Streaming. >>> >>> Hope you can guide me in the right direction >>> >>> Thx >>> >>> >>> >>> >>> >>> >>> >>> >> >