Hi Palle,

this sounds indeed like a good use case for Flink.

Depending on the complexity of the aggregated historical views, you can
implement a Flink DataStream program which builds the views on the fly,
i.e., you do not need to periodically trigger MR/Flink/Spark batch jobs to
compute the views. Instead, you can use the concept of windows to group
data by time (and other attributes) and compute the aggregates (depends on
the type of aggregates) on-the-fly while data is arriving.

The live model can also be computed by Flink. You can access the historic
data from an external store (HBase / Mongo) also cache parts of it in the
Flink job to achieve lower latency. It is also possible to store the live
model in your Flink job and query it from there (see this blogpost [1],
section "Winning Twitter Hack Week: Eliminating the key-value store
bottleneck"). Flink will partition the data, so it should be able to handle
the data sizes you mentioned.

Best, Fabian

[1] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

2016-05-06 13:40 GMT+02:00 Deepak Sharma <deepakmc...@gmail.com>:

> I see the flow to be as below:
> LogStash->Log Stream->Flink ->Kafka->Live Model
>                                        |
>                                     Mongo/HBASE
>
> The Live Model will again be Flink streaming data sets from Kakfa.
> There you analyze the incoming stream for the certain value and once you
> find this certain value , read the historical view and then do the analysis
> in Flink itself.
> For your java objects , i guess you can use checkpointed interface (have
> not used it though yet)
>
> Thanks
> Deepak
>
>
> On Fri, May 6, 2016 at 4:22 PM, <pa...@sport.dk> wrote:
>
>> Hi there.
>>
>> We are putting together some BigData components for handling a large
>> amount of incoming data from different log files and perform some analysis
>> on the data.
>>
>> All data being fed into the system will go into HDFS. We plan on using
>> Logstash, Kafka and Flink for bringing data from the log files and into
>> HDFS. All our data located in HDFS we will designate as our historic data
>> and we will use MapReduce (probably Flink, but could also be Hadoop) to
>> create some aggregate views of the historic data. These views we will
>> locate probably in HBase or MongoDB.
>>
>> These views of the historic data (also called batch views in the Lambda
>> Architecture if any of you are familiar with that) we will use from the
>> live model in the system. The live model is also being fed with the same
>> data (through Kafka) and when the live model detects a certain value in the
>> incoming data, it will perform some analysis using the views in
>> HBase/MongoDB of the historic data.
>>
>> Now, could anyone share some knowledge regarding where it would be
>> possible to implement such a live model given the components we plan on
>> using? Apart from the business logic that will perform the analysis, our
>> live model will at all times also contain a java object structure of maybe
>> 5-10 java collections (maps, lists) containing approx 5 mio objects.
>>
>> So, where is it possible to implement our live model? Can we do this in
>> Flink? Can we do this with another component within the Hadoop Big Data
>> ecosystem?
>>
>> Thanks.
>>
>> /Palle
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Reply via email to