Re: TeraSort on Flink and Spark

2015-07-12 Thread Dongwon Kim
Hi Jiang, Please refer to http://sortbenchmark.org/. When you take a look at the specification of each node Spark team uses, you can easily realize that # of nodes is not the only thing to take into consideration. You miss important things to consider for a fair comparison. (1) # of disks in each

Re: TeraSort on Flink and Spark

2015-07-12 Thread Hawin Jiang
Hi Kim and Stephan Kim's report is sorting 3360GB per 1427 seconds by Flink 0.9.0. 3360 = 80*42 ((80GB/per node and 42 nodes) Based on Kim's report. The TPS is 2.35GB/sec for Flink 0.9.0 Kim was using 42 nodes for testing purposes. I found that the best Spark performance result was using

Re: TeraSort on Flink and Spark

2015-07-10 Thread Stephan Ewen
Hi Dongwon Kim! Thank you for trying out these changes. The OptimizedText can be sorted more efficiently, because it generates a binary key prefix. That way, the sorting needs to serialize/deserialize less and saves on CPU. In parts of the program, the CPU is then less of a bottleneck and the di

Re: TeraSort on Flink and Spark

2015-07-10 Thread Fabian Hueske
Hi Dongwon Kim, this blog post describes Flink's memory management, serialization, and sort algorithm and also includes performance numbers of some microbenchmarks. --> http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html The difference between Text and OptimizedText, is tha

Re: TeraSort on Flink and Spark

2015-07-03 Thread Stephan Ewen
Flavio, In general, String works well in Flink, because it behaves for sorting much like this OptimizedText. If you want to access the String contents, then using String is good. Text may have slight advantages if you never access the actual contents, but just partition and sort it (as in TeraSor

Re: TeraSort on Flink and Spark

2015-07-02 Thread Flavio Pompermaier
Hi Stephan, if I understood correctly you are substituting the Text key with a more efficient version (OptimizedText). Just one question: you set max lenght of the key to 10..you know that a priori? This implementation of the key is much more efficient that just using String? Is there any compariso

Re: TeraSort on Flink and Spark

2015-07-02 Thread Stephan Ewen
Hello Dongwon Kim! Thanks you for sharing these numbers with us. I have gone through your implementation and there are two things you could try: 1) I see that you sort Hadoop's Text data type with Flink. I think this may be less efficient than if you sort String, or a Flink specific data type.