Hi Everybody, I'm a soon-to-graduate student of computer science at the Univeristy of Wrocław in Poland. Currently I'm starting to write my master thesis and I'm looking for some inspirations/ideas.
First of all I want to write about MapReduce - as far as I know nobody took such topics as their thesis at my faculty, but the topic is interesting, so someone should start. Lately I thought that maybe I could consider comparing Java's MapReduce with Hive and Pig in terms of it's performance, optimizations that are used inside etc. Personally I had found it nice idea as it would allowed me to learn both frameworks and take a look at the way they work. Unfortunately I found out that Robert Stewart from Heriot Watt Univeristy wrote his thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and Java" which can be found via Google. I looked through this paper and it looks quite similar to what I wanted to do. After this discover I thought that maybe a little bit different approach to performance comparison can prove to be a succesful topic for my master thesis: specifically I'm thinking about comparing the frameworks in some real-life problem. Robert in his paper made the experiments on few quite simple problems like word count, simple join of two sets or logs proccessing. I'm thinking about first: comparing them in real-life problem and second: look for optimizations that can be made in Pig or Hive (e.g. choosing join strategy) and how it affects the performance of the frameworks. Ok, after this long introduction I want to ask you: do you think it is interesting approach and does it make any sense? Is it worth trying? If so - maybe you can suggest me the features of frameworks on which I should look closer and maybe a real-life problems that can be used in the experiments? I look forward for any comments - thanks in advance. p.s. I've posted this messege on both framework's mailing lists - hive and pig. Thanks! Michal