Re: spark vs flink batch performance

2016-11-18 Thread Gábor Gévay
> "For csv reading, i deliberately did not use csv reader since i want to run > same code across spark and flink." > > If your objective deviates from writing and running the fastest Spark and > fastest Flink programs, then your comparison is worthless. Well, I don't really agree with this. I woul

Re: spark vs flink batch performance

2016-11-18 Thread Greg Hogan
"For csv reading, i deliberately did not use csv reader since i want to run same code across spark and flink." If your objective deviates from writing and running the fastest Spark and fastest Flink programs, then your comparison is worthless. On Fri, Nov 18, 2016 at 5:37 AM, CPC wrote: > Hi G

Re: spark vs flink batch performance

2016-11-18 Thread CPC
Thank you Flavio. I will generate flamegraph for flink and compare them. On 18 November 2016 at 13:43, Flavio Pompermaier wrote: > I think this could be very helpful for your study: > > http://db-blog.web.cern.ch/blog/luca-canali/2016-09-spark-20-performance- > improvements-investigated-flame-gr

Re: spark vs flink batch performance

2016-11-18 Thread Flavio Pompermaier
I think this could be very helpful for your study: http://db-blog.web.cern.ch/blog/luca-canali/2016-09-spark-20-performance-improvements-investigated-flame-graphs Best, Flavio On Fri, Nov 18, 2016 at 11:37 AM, CPC wrote: > Hi Gabor, > > Thank you for your kind response. I forget to mention tha

Re: spark vs flink batch performance

2016-11-18 Thread CPC
Hi Gabor, Thank you for your kind response. I forget to mention that i have actually three workers. This is why i set default paralelism to 6. For csv reading, i deliberately did not use csv reader since i want to run same code across spark and flink. Collect is returning 40k records which is not

Re: spark vs flink batch performance

2016-11-18 Thread Gábor Gévay
Hello, Your program looks mostly fine, but there are a few minor things that might help a bit: Parallelism: In your attached flink-conf.yaml, you have 2 task slots per task manager, and if you have 1 task manager, then your total number of task slots is also 2. However, your default parallelism i

Re: spark vs flink batch performance

2016-11-17 Thread CPC
Hi all, In the mean time i have three workers. Any thoughts about improving flink performance? Thank you... On Nov 17, 2016 00:38, "CPC" wrote: > Hi all, > > I am trying to compare spark and flink batch performance. In my test i am > using ratings.csv in http://files.grouplens.org/ > datasets/

spark vs flink batch performance

2016-11-16 Thread CPC
Hi all, I am trying to compare spark and flink batch performance. In my test i am using ratings.csv in http://files.grouplens.org/datasets/movielens/ml-latest.zip dataset. I also concatenated ratings.csv 16 times to increase dataset size(total of 390465536 records almost 10gb).I am reading from go