Hey Kevin,

Your questions regarding the future of this space are so difficult to
answer that investors are betting millions of dollars on many possible
outcomes. You should consult them. :)

When you are choosing a system your best bet is current capabilities, the
roadmap and the community of the system.

As far as TeraStuff goes a look here [1] might help.

[1]
http://eastcirclek.blogspot.hu/2015/06/terasort-for-spark-and-flink-with-range.html

Marton

On Fri, Jul 8, 2016 at 3:33 PM, Kevin Jacobs <kevin.jac...@cern.ch> wrote:

> Hi Marton,
>
> Thank you for your elaborate answer. I will comment in your e-mail below:
>
> On 08.07.2016 15:13, Márton Balassi wrote:
>
>> Hi Kevin,
>>
>> Thanks for being willing to contribute such an effort. I think it is a
>> completely valid discussion to ask in your organization and please feel
>> free to ask us questions during your evaluation. Putting statements on the
>> Flink website highlighting the differences would be very tricky though. I
>> would advise against that. Let me elaborate on that.
>>
>
> Thank you, I will definitely ask questions during the evaluation, next
> week we will be setting up some experiments.
>
>
>> The "How does it compare to Spark?" is definitely one of the most
>> frequently asked questions that we get and we can generally give three
>> types of answers:
>>
>> *1. General architecture decisions*
>>
>>     - Streaming (pipelined) execution engine (or long running opreator
>>     model).
>>     - Native iteration operator.
>>     - ...
>>
>> The issue with this approach is that in itself it states borderline no
>> useful information for a decision maker. There you need benchmarks or
>> fancy
>> features, so let us evaluate them.
>>
>
> That is definitely true, but don't you think that Flink and Spark will
> "collapse" at some point in time? The differences between the two
> frameworks are getting smaller and smaller, Spark also has support for
> streaming. Or will the difference in the architecture be key in
> differentiating the two frameworks?
>
>
>> *2. Benchmarks*
>> You can find plenty of third-party benchmarks and soft evaluations [1,2,3]
>> of the two systems out there. The problem with these are that they are
>> very
>> reliant on the version of the systems used, tuning and understanding the
>> general architecture. E.g. [1] favors Storm, but if you re-do the whole
>> benchmark from a Flink point of view you get [4]. After a couple of
>> versions the benchmark results can be very different.
>>
>> *3. Fancy Features*
>>
>>     - Exactly once spillable streaming state stored locally
>>     - Savepoints
>>     - ...
>>
>> Similarly to the previous point these might be an edge at some point in
>> time, but the whole streaming space is moving very quickly and as it is
>> open source projects tend to copy each other to a certain extent.
>>
>
> Why is this spacing moving so quickly? Is it due to the new technologies
> that arise of processing streaming data? Would that not converge to only a
> handful of stable frameworks in the future (just speculating)?
>
>
>> This of course does not mean that doing evaluations at any point in time
>> is
>> meaningless, but you need to update them frequently (check [5] and [6])
>> and
>> they can do more harm then good if not treated with care.
>>
>
> It would be great if there were evaluation methods that are reusable, so
> this process does not have to be repeated every time. Unfortunately, there
> always is a difference with previous frameworks, so that implies that
> custom made evaluations should be made for every new framework. I like the
> TeraGen/TeraSort/TeraValidate benchmark, that is at least a general
> benchmark approach too some extend.
>
>
>> I hope I was not too discouraging and could help you with your endeavor.
>> It
>> is also very important to take your specific use cases into account.
>>
>
> It is definitely not discouraging, thank you for the answer :-)!
>
>
>
>> Best,
>>
>> Marton
>>
>> [1]
>>
>> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>> [2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/
>> [3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/
>> [4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
>> [5]
>>
>> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem
>> [6]
>>
>> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821
>>
>> On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <kevin.jac...@cern.ch>
>> wrote:
>>
>> Hi,
>>>
>>> I am currently working working for an organization which is using Apache
>>> Spark as main data processing framework. Now the organization is
>>> wondering
>>> whether Apache Flink is better at processing their data than Apache
>>> Spark.
>>> Therefore, I am evaluating Apache Flink and I am comparing it to Apache
>>> Spark.
>>>
>>> When I looked at Apache Flink for the first time, I could not find any
>>> comparison to Apache Spark at Flink's website. Would it be an idea to
>>> give
>>> some information about the differences of both frameworks on the
>>> website? I
>>> would like to contribute to that if you think that would be helpful.
>>>
>>> Regards,
>>> Kevin
>>>
>>>
>

Reply via email to