Re: Creating a representative streaming workload

Robert Metzger Wed, 18 Nov 2015 02:15:07 -0800

Hey Vasia,

I think a very common workload would be an event stream from web servers of
an online shop. Usually, these shops have multiple servers, so events
arrive out of order.
I think there are plenty of different use cases that you can build around
that data:
- Users perform different actions that a streaming system could track
(analysis of click-paths),
- some simple statistics using windows (items sold in the last 10 minutes,
..).
- Maybe fraud detection would be another use case.
- Often, there also needs to be a sink to HDFS or another file system for a
long-term archive.


I would love to see such an event generator in flink's contrib module. I
think that's something the entire streaming space could use.




On Mon, Nov 16, 2015 at 8:22 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:

> All those should apply for streaming too...
>
> On Mon, Nov 16, 2015 at 11:06 AM, Vasiliki Kalavri <
> vasilikikala...@gmail.com> wrote:
>
>> Hi,
>>
>> thanks Nick and Ovidiu for the links!
>>
>> Just to clarify, we're not looking into creating a generic streaming
>> benchmark. We have quite limited time and resources for this project. What
>> we want is to decide on a set of 3-4 _common_ streaming applications. To
>> give you an idea, for the batch workload, we will pick something like a
>> grep, one relational application, a graph algorithm, and an ML algorithm.
>>
>> Cheers,
>> -Vasia.
>>
>> On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Regarding Flink vs Spark / Storm you can check here:
>>> http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark
>>>
>>> Best regards,
>>> Ovidiu
>>>
>>> On 16 Nov 2015, at 15:21, Vasiliki Kalavri <vasilikikala...@gmail.com>
>>> wrote:
>>>
>>> Hello squirrels,
>>>
>>> with some colleagues and students here at KTH, we have started 2
>>> projects to evaluate (1) performance and (2) behavior in the presence of
>>> memory interference in cloud environments, for Flink and other systems. We
>>> want to provide our students with a workload of representative applications
>>> for testing.
>>>
>>> While for batch applications, it is quite clear to us what classes of
>>> applications are widely used and how to create a workload of different
>>> types of applications, we are not quite sure about the streaming workload.
>>>
>>> That's why, we'd like your opinions! If you're using Flink streaming in
>>> your company or your project, we'd love your input even more :-)
>>>
>>> What kind of applications would you consider as "representative" of a
>>> streaming workload? Have you run any experiments to evaluate Flink versus
>>> Spark, Storm etc.? If yes, would you mind sharing your code with us?
>>>
>>> We will of course be happy to share our results with everyone after we
>>> have completed our study.
>>>
>>> Thanks a lot!
>>> -Vasia.
>>>
>>>
>>>
>>
>

Re: Creating a representative streaming workload

Reply via email to