Hi, just to throw in my 2 cents: if your window operations don't require that all elements are kept as they are you can greatly reduce your state size by using a ReduceFunction on your window. With this, the state size would essentially become <per-item-size> * <num keys> * <num windows>.
Best, Aljoscha On Sun, 26 Feb 2017 at 14:16 Timur Shenkao <t...@timshenkao.su> wrote: > Hi, > > 100 million rows is small load, especially for 1 week. > I suspect that your load would be quite evenly distributed during the day > as it's plant not humans. > If you look for reliability, make 2 Kafka servers at least where each > topic has 6 partitions. And separate Hadoop cluster for Flink. > > As for duplicate messages, it's not a problem of Flink or Cassandra. It's > a logical problem, i.e. it's up to you how to achieve exactly once > semantics. > I advise you to use some storage anyway for reliability and failover. > > Sincerely yours, Timur Shenkao > > > On Friday, February 24, 2017, Patrick Brunmayr <j...@kpibench.com> wrote: > > Hi > > Yes it is and would be nice to handle this with Flink :) > > - > *Size of data point* > The size of a data point is basically just a simple case class with two > fields in a 64bit OS > > case class MachineData(sensorId: String, eventTime:Long) > > *- Last write wins* > > We have cassandra as data warehouse but i was hoping i could solve that > issue in the state level rather than in the db level. The reason beeing is > one could send me the same events > over and over again and this will cause that state to blow up until out of > memory. Secondly by doing aggregations per sensor results will be wrong due > multiple events with the same > timestamp. > > thx > > > > > > 2017-02-24 17:47 GMT+01:00 Robert Metzger <rmetz...@apache.org>: > > Hi, > sounds like a cool project. > > What's the size of one data point? > If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 > gigabytes of state. That's something one or two machines (depending on the > disk throughput) should be able to handle. > > If possible, I would recommend you to do an experiment using a prototype > to see how many machines you need for your workload. > > On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> > wrote: > > Hi Patrick, > > Thanks a lot for feedback on your use case! At a first glance, I would say > that Flink can definitely solve the issues you are evaluating. > > I’ll try to explain them, and point you to some docs / articles that can > further explain in detail: > > *- Lateness* > > The 7-day lateness shouldn’t be a problem. We definitely recommend > using RocksDB as the state backend for such a use case, as you > mentioned correctly, the state would be kept for a long time. > The heavy burst when your locally buffered data on machines are > sent to Kafka once they come back online shouldn’t be a problem either; > since Flink is a pure data streaming engine, it handles backpressure > naturally without any additional mechanisms (I would recommend > taking a look at http://data-artisans.com/how-flink-handles-backpressure/ > ). > > *- Out of Order* > > That’s exactly what event time processing is for :-) As long as the event > comes in before the allowed lateness for windows, the event will still fall > into its corresponding event time window. So, even with the heavy burst of > the your late machine data, they will still be aggregated in the correct > windows. > You can look into event time in Flink with more detail in the event time > docs: > > https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html > > *- Last write wins* > > Your operators that does the aggregations simply need to be able to > reprocess > results if it sees an event with the same id come in. Now, if results are > sent out > of Flink and stored in an external db, if you can design the db writes to > be idempotent, > then it’ll effectively be a “last write wins”. It depends mostly on your > pipeline and > use case. > > *- Computations per minute* > I think you can simply do this by having two separate window operators. > One that works on your longer window, and another on a per-minute basis. > > Hope this helps! > > - Gordon > > On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (j...@kpibench.com) > wrote: > > Hello > > I've done my first steps with Flink and i am very impressed of its > capabilities. Thank you for that :) I want to use it for a project we are > currently working on. After reading some documentation > i am not sure if it's the right tool for the job. We have an IoT > application in which we are monitoring machines in production plants. The > machines have sensors attached and they are sending > their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute > basis. > > Following requirements must be fulfilled > > > - Lateness > > We have to allow lateness for 7 days because machines can have down > time due network issues, maintenance or something else. If thats the case > buffering of data happens localy on the machine and once they > are online again all data will be sent to the broker. This can result > in some relly heavy burst. > > > - Out of order > > Events come out of order due this lateness issues > > > - Last write wins > > Machines are not stateful and can not guarantee exactly once sending > of their data. It can happen that sometimes events are sent twice. In that > case the last event wins and should override the previous one. > Events are unique due a sensor_id and a timestamp > > - Computations per minute > > We can not wait until the windows ends and have to do computations on > a per minute basis. For example aggregating data per sensor and writing it > to a db > > > My biggest concern in that case is the huge lateness. Keeping data for 7 > days would result in 10080 data points for just one sensor! Multiplying > that by 10.000 sensors would result in 100800000 datapoints which Flink > would have to handle in its state. The number of sensors are constantly > growing so will the number of data points > > So my questions are > > > - Is Flink the right tool for the Job ? > > - Is that lateness an issue ? > > - How can i implement the Last write wins ? > > - How to tune flink to handle that growing load of sensors and data > points ? > > - Hardware requirements, storage and memory size ? > > > > I don't want to maintain two code base for batch and streaming because the > operations are all equal. The only difference is the time range! Thats the > reason i wanted to do all this with Flink Streaming. > > Hope you can guide me in the right direction > > Thx > > > > > > > > > >