Hey Kevin, Your questions regarding the future of this space are so difficult to answer that investors are betting millions of dollars on many possible outcomes. You should consult them. :)
When you are choosing a system your best bet is current capabilities, the roadmap and the community of the system. As far as TeraStuff goes a look here [1] might help. [1] http://eastcirclek.blogspot.hu/2015/06/terasort-for-spark-and-flink-with-range.html Marton On Fri, Jul 8, 2016 at 3:33 PM, Kevin Jacobs <kevin.jac...@cern.ch> wrote: > Hi Marton, > > Thank you for your elaborate answer. I will comment in your e-mail below: > > On 08.07.2016 15:13, Márton Balassi wrote: > >> Hi Kevin, >> >> Thanks for being willing to contribute such an effort. I think it is a >> completely valid discussion to ask in your organization and please feel >> free to ask us questions during your evaluation. Putting statements on the >> Flink website highlighting the differences would be very tricky though. I >> would advise against that. Let me elaborate on that. >> > > Thank you, I will definitely ask questions during the evaluation, next > week we will be setting up some experiments. > > >> The "How does it compare to Spark?" is definitely one of the most >> frequently asked questions that we get and we can generally give three >> types of answers: >> >> *1. General architecture decisions* >> >> - Streaming (pipelined) execution engine (or long running opreator >> model). >> - Native iteration operator. >> - ... >> >> The issue with this approach is that in itself it states borderline no >> useful information for a decision maker. There you need benchmarks or >> fancy >> features, so let us evaluate them. >> > > That is definitely true, but don't you think that Flink and Spark will > "collapse" at some point in time? The differences between the two > frameworks are getting smaller and smaller, Spark also has support for > streaming. Or will the difference in the architecture be key in > differentiating the two frameworks? > > >> *2. Benchmarks* >> You can find plenty of third-party benchmarks and soft evaluations [1,2,3] >> of the two systems out there. The problem with these are that they are >> very >> reliant on the version of the systems used, tuning and understanding the >> general architecture. E.g. [1] favors Storm, but if you re-do the whole >> benchmark from a Flink point of view you get [4]. After a couple of >> versions the benchmark results can be very different. >> >> *3. Fancy Features* >> >> - Exactly once spillable streaming state stored locally >> - Savepoints >> - ... >> >> Similarly to the previous point these might be an edge at some point in >> time, but the whole streaming space is moving very quickly and as it is >> open source projects tend to copy each other to a certain extent. >> > > Why is this spacing moving so quickly? Is it due to the new technologies > that arise of processing streaming data? Would that not converge to only a > handful of stable frameworks in the future (just speculating)? > > >> This of course does not mean that doing evaluations at any point in time >> is >> meaningless, but you need to update them frequently (check [5] and [6]) >> and >> they can do more harm then good if not treated with care. >> > > It would be great if there were evaluation methods that are reusable, so > this process does not have to be repeated every time. Unfortunately, there > always is a difference with previous frameworks, so that implies that > custom made evaluations should be made for every new framework. I like the > TeraGen/TeraSort/TeraValidate benchmark, that is at least a general > benchmark approach too some extend. > > >> I hope I was not too discouraging and could help you with your endeavor. >> It >> is also very important to take your specific use cases into account. >> > > It is definitely not discouraging, thank you for the answer :-)! > > > >> Best, >> >> Marton >> >> [1] >> >> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at >> [2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/ >> [3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/ >> [4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ >> [5] >> >> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem >> [6] >> >> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821 >> >> On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <kevin.jac...@cern.ch> >> wrote: >> >> Hi, >>> >>> I am currently working working for an organization which is using Apache >>> Spark as main data processing framework. Now the organization is >>> wondering >>> whether Apache Flink is better at processing their data than Apache >>> Spark. >>> Therefore, I am evaluating Apache Flink and I am comparing it to Apache >>> Spark. >>> >>> When I looked at Apache Flink for the first time, I could not find any >>> comparison to Apache Spark at Flink's website. Would it be an idea to >>> give >>> some information about the differences of both frameworks on the >>> website? I >>> would like to contribute to that if you think that would be helpful. >>> >>> Regards, >>> Kevin >>> >>> >