Hi Jiang,
Please refer to http://sortbenchmark.org/.
When you take a look at the specification of each node Spark team uses,
you can easily realize that # of nodes is not the only thing to take into
consideration.
You miss important things to consider for a fair comparison.
(1) # of disks in each
Hi Kim and Stephan
Kim's report is sorting 3360GB per 1427 seconds by Flink 0.9.0. 3360 =
80*42 ((80GB/per node and 42 nodes)
Based on Kim's report. The TPS is 2.35GB/sec for Flink 0.9.0
Kim was using 42 nodes for testing purposes. I found that the best Spark
performance result was using
Hi Dongwon Kim!
Thank you for trying out these changes.
The OptimizedText can be sorted more efficiently, because it generates a
binary key prefix. That way, the sorting needs to serialize/deserialize
less and saves on CPU.
In parts of the program, the CPU is then less of a bottleneck and the di
Hi Dongwon Kim,
this blog post describes Flink's memory management, serialization, and sort
algorithm and also includes performance numbers of some microbenchmarks.
-->
http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
The difference between Text and OptimizedText, is tha
Flavio,
In general, String works well in Flink, because it behaves for sorting much
like this OptimizedText.
If you want to access the String contents, then using String is good. Text
may have slight advantages if you never access the actual contents, but
just partition and sort it (as in TeraSor
Hi Stephan,
if I understood correctly you are substituting the Text key with a more
efficient version (OptimizedText).
Just one question: you set max lenght of the key to 10..you know that a
priori?
This implementation of the key is much more efficient that just using
String?
Is there any compariso
Hello Dongwon Kim!
Thanks you for sharing these numbers with us.
I have gone through your implementation and there are two things you could
try:
1)
I see that you sort Hadoop's Text data type with Flink. I think this may be
less efficient than if you sort String, or a Flink specific data type.