Re: Flink the right tool for the job ? Huge Data window lateness

Timur Shenkao Sun, 26 Feb 2017 05:16:42 -0800

Hi,

100 million rows is small load, especially for 1 week.
I suspect that your load would be quite evenly distributed during the day
as it's plant not humans.
If you look for reliability, make 2 Kafka servers at least where each topic
has 6 partitions. And separate Hadoop cluster for Flink.


As for duplicate messages, it's not a problem of Flink or Cassandra. It's a
logical problem, i.e. it's up to you how to achieve exactly once semantics.
I advise you to use some storage anyway for reliability and failover.

Sincerely yours, Timur Shenkao

On Friday, February 24, 2017, Patrick Brunmayr <j...@kpibench.com> wrote:

> Hi
>
> Yes it is and would be nice to handle this with Flink :)
>
> -
> *Size of data point*
> The size of a data point is basically just a simple case class with two
> fields in a 64bit OS
>
> case class MachineData(sensorId: String, eventTime:Long)
>
> *- Last write wins*
>
> We have cassandra as data warehouse but i was hoping i could solve that
> issue in the state level rather than in the db level. The reason beeing is
> one could send me the same events
> over and over again and this will cause that state to blow up until out of
> memory. Secondly by doing aggregations per sensor results will be wrong due
> multiple events with the same
> timestamp.
>
> thx
>
>
>
>
>
> 2017-02-24 17:47 GMT+01:00 Robert Metzger <rmetz...@apache.org
> <javascript:_e(%7B%7D,'cvml','rmetz...@apache.org');>>:
>
>> Hi,
>> sounds like a cool project.
>>
>> What's the size of one data point?
>> If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206
>> gigabytes of state. That's something one or two machines (depending on the
>> disk throughput) should be able to handle.
>>
>> If possible, I would recommend you to do an experiment using a prototype
>> to see how many machines you need for your workload.
>>
>> On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org
>> <javascript:_e(%7B%7D,'cvml','tzuli...@apache.org');>> wrote:
>>
>>> Hi Patrick,
>>>
>>> Thanks a lot for feedback on your use case! At a first glance, I would
>>> say that Flink can definitely solve the issues you are evaluating.
>>>
>>> I’ll try to explain them, and point you to some docs / articles that can
>>> further explain in detail:
>>>
>>> *- Lateness*
>>>
>>> The 7-day lateness shouldn’t be a problem. We definitely recommend
>>> using RocksDB as the state backend for such a use case, as you
>>> mentioned correctly, the state would be kept for a long time.
>>> The heavy burst when your locally buffered data on machines are
>>> sent to Kafka once they come back online shouldn’t be a problem either;
>>> since Flink is a pure data streaming engine, it handles backpressure
>>> naturally without any additional mechanisms (I would recommend
>>> taking a look at http://data-artisans.com/how-f
>>> link-handles-backpressure/).
>>>
>>> *- Out of Order*
>>>
>>> That’s exactly what event time processing is for :-) As long as the event
>>> comes in before the allowed lateness for windows, the event will still
>>> fall
>>> into its corresponding event time window. So, even with the heavy burst
>>> of
>>> the your late machine data, they will still be aggregated in the correct
>>> windows.
>>> You can look into event time in Flink with more detail in the event time
>>> docs:
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>> dev/event_time.html
>>>
>>> *- Last write wins*
>>>
>>> Your operators that does the aggregations simply need to be able to
>>> reprocess
>>> results if it sees an event with the same id come in. Now, if results
>>> are sent out
>>> of Flink and stored in an external db, if you can design the db writes
>>> to be idempotent,
>>> then it’ll effectively be a “last write wins”. It depends mostly on your
>>> pipeline and
>>> use case.
>>>
>>> *- Computations per minute*
>>> I think you can simply do this by having two separate window operators.
>>> One that works on your longer window, and another on a per-minute basis.
>>>
>>> Hope this helps!
>>>
>>> - Gordon
>>>
>>> On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (j...@kpibench.com
>>> <javascript:_e(%7B%7D,'cvml','j...@kpibench.com');>) wrote:
>>>
>>> Hello
>>>
>>> I've done my first steps with Flink and i am very impressed of its
>>> capabilities. Thank you for that :) I want to use it for a project we are
>>> currently working on. After reading some documentation
>>> i am not sure if it's the right tool for the job. We have an IoT
>>> application in which we are monitoring machines in production plants. The
>>> machines have sensors attached and they are sending
>>> their data to a broker ( Kafka, Azure Iot Hub ) currently on a per
>>> minute basis.
>>>
>>> Following requirements must be fulfilled
>>>
>>>
>>>    - Lateness
>>>
>>>    We have to allow lateness for 7 days because machines can have down
>>>    time due network issues, maintenance or something else. If thats the case
>>>    buffering of data happens localy on the machine and once they
>>>    are online again all data will be sent to the broker. This can
>>>    result in some relly heavy burst.
>>>
>>>
>>>    - Out of order
>>>
>>>    Events come out of order due this lateness issues
>>>
>>>
>>>    - Last write wins
>>>
>>>    Machines are not stateful and can not guarantee exactly once sending
>>>    of their data. It can happen that sometimes events are sent twice. In 
>>> that
>>>    case the last event wins and should override the previous one.
>>>    Events are unique due a sensor_id and a timestamp
>>>
>>>    - Computations per minute
>>>
>>>    We can not wait until the windows ends and have to do computations
>>>    on a per minute basis. For example aggregating data per sensor and 
>>> writing
>>>    it to a db
>>>
>>>
>>> My biggest concern in that case is the huge lateness. Keeping data for 7
>>> days would result in 10080 data points for just one sensor! Multiplying
>>> that by 10.000 sensors would result in 100800000 datapoints which Flink
>>> would have to handle in its state. The number of sensors are constantly
>>> growing so will the number of data points
>>>
>>> So my questions are
>>>
>>>
>>>    - Is Flink the right tool for the Job ?
>>>
>>>    - Is that lateness an issue ?
>>>
>>>    - How can i implement the Last write wins ?
>>>
>>>    - How to tune flink to handle that growing load of sensors and data
>>>    points ?
>>>
>>>    - Hardware requirements, storage and memory size ?
>>>
>>>
>>>
>>> I don't want to maintain two code base for batch and streaming because
>>> the operations are all equal. The only difference is the time range! Thats
>>> the reason i wanted to do all this with Flink Streaming.
>>>
>>> Hope you can guide me in the right direction
>>>
>>> Thx
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Flink the right tool for the job ? Huge Data window lateness

Reply via email to