Re: Flink the right tool for the job ? Huge Data window lateness

Aljoscha Krettek Mon, 27 Feb 2017 08:54:12 -0800

Hi,
just to throw in my 2 cents: if your window operations don't require that
all elements are kept as they are you can greatly reduce your state size by
using a ReduceFunction on your window. With this, the state size would
essentially become <per-item-size> * <num keys> * <num windows>.


Best,
Aljoscha

On Sun, 26 Feb 2017 at 14:16 Timur Shenkao <t...@timshenkao.su> wrote:

> Hi,
>
> 100 million rows is small load, especially for 1 week.
> I suspect that your load would be quite evenly distributed during the day
> as it's plant not humans.
> If you look for reliability, make 2 Kafka servers at least where each
> topic has 6 partitions. And separate Hadoop cluster for Flink.
>
> As for duplicate messages, it's not a problem of Flink or Cassandra. It's
> a logical problem, i.e. it's up to you how to achieve exactly once
> semantics.
> I advise you to use some storage anyway for reliability and failover.
>
> Sincerely yours, Timur Shenkao
>
>
> On Friday, February 24, 2017, Patrick Brunmayr <j...@kpibench.com> wrote:
>
> Hi
>
> Yes it is and would be nice to handle this with Flink :)
>
> -
> *Size of data point*
> The size of a data point is basically just a simple case class with two
> fields in a 64bit OS
>
> case class MachineData(sensorId: String, eventTime:Long)
>
> *- Last write wins*
>
> We have cassandra as data warehouse but i was hoping i could solve that
> issue in the state level rather than in the db level. The reason beeing is
> one could send me the same events
> over and over again and this will cause that state to blow up until out of
> memory. Secondly by doing aggregations per sensor results will be wrong due
> multiple events with the same
> timestamp.
>
> thx
>
>
>
>
>
> 2017-02-24 17:47 GMT+01:00 Robert Metzger <rmetz...@apache.org>:
>
> Hi,
> sounds like a cool project.
>
> What's the size of one data point?
> If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206
> gigabytes of state. That's something one or two machines (depending on the
> disk throughput) should be able to handle.
>
> If possible, I would recommend you to do an experiment using a prototype
> to see how many machines you need for your workload.
>
> On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org>
> wrote:
>
> Hi Patrick,
>
> Thanks a lot for feedback on your use case! At a first glance, I would say
> that Flink can definitely solve the issues you are evaluating.
>
> I’ll try to explain them, and point you to some docs / articles that can
> further explain in detail:
>
> *- Lateness*
>
> The 7-day lateness shouldn’t be a problem. We definitely recommend
> using RocksDB as the state backend for such a use case, as you
> mentioned correctly, the state would be kept for a long time.
> The heavy burst when your locally buffered data on machines are
> sent to Kafka once they come back online shouldn’t be a problem either;
> since Flink is a pure data streaming engine, it handles backpressure
> naturally without any additional mechanisms (I would recommend
> taking a look at http://data-artisans.com/how-flink-handles-backpressure/
> ).
>
> *- Out of Order*
>
> That’s exactly what event time processing is for :-) As long as the event
> comes in before the allowed lateness for windows, the event will still fall
> into its corresponding event time window. So, even with the heavy burst of
> the your late machine data, they will still be aggregated in the correct
> windows.
> You can look into event time in Flink with more detail in the event time
> docs:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html
>
> *- Last write wins*
>
> Your operators that does the aggregations simply need to be able to
> reprocess
> results if it sees an event with the same id come in. Now, if results are
> sent out
> of Flink and stored in an external db, if you can design the db writes to
> be idempotent,
> then it’ll effectively be a “last write wins”. It depends mostly on your
> pipeline and
> use case.
>
> *- Computations per minute*
> I think you can simply do this by having two separate window operators.
> One that works on your longer window, and another on a per-minute basis.
>
> Hope this helps!
>
> - Gordon
>
> On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (j...@kpibench.com)
> wrote:
>
> Hello
>
> I've done my first steps with Flink and i am very impressed of its
> capabilities. Thank you for that :) I want to use it for a project we are
> currently working on. After reading some documentation
> i am not sure if it's the right tool for the job. We have an IoT
> application in which we are monitoring machines in production plants. The
> machines have sensors attached and they are sending
> their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute
> basis.
>
> Following requirements must be fulfilled
>
>
>    - Lateness
>
>    We have to allow lateness for 7 days because machines can have down
>    time due network issues, maintenance or something else. If thats the case
>    buffering of data happens localy on the machine and once they
>    are online again all data will be sent to the broker. This can result
>    in some relly heavy burst.
>
>
>    - Out of order
>
>    Events come out of order due this lateness issues
>
>
>    - Last write wins
>
>    Machines are not stateful and can not guarantee exactly once sending
>    of their data. It can happen that sometimes events are sent twice. In that
>    case the last event wins and should override the previous one.
>    Events are unique due a sensor_id and a timestamp
>
>    - Computations per minute
>
>    We can not wait until the windows ends and have to do computations on
>    a per minute basis. For example aggregating data per sensor and writing it
>    to a db
>
>
> My biggest concern in that case is the huge lateness. Keeping data for 7
> days would result in 10080 data points for just one sensor! Multiplying
> that by 10.000 sensors would result in 100800000 datapoints which Flink
> would have to handle in its state. The number of sensors are constantly
> growing so will the number of data points
>
> So my questions are
>
>
>    - Is Flink the right tool for the Job ?
>
>    - Is that lateness an issue ?
>
>    - How can i implement the Last write wins ?
>
>    - How to tune flink to handle that growing load of sensors and data
>    points ?
>
>    - Hardware requirements, storage and memory size ?
>
>
>
> I don't want to maintain two code base for batch and streaming because the
> operations are all equal. The only difference is the time range! Thats the
> reason i wanted to do all this with Flink Streaming.
>
> Hope you can guide me in the right direction
>
> Thx
>
>
>
>
>
>
>
>
>
>

Re: Flink the right tool for the job ? Huge Data window lateness

Reply via email to