If you look at the recurrent issues in datacentre-scale computing systems, two
stand out
-resilience to failure: that's algorithms and the layers underneath (storage,
work allocation & tracking ...)
-scheduling: maximising resource utilisation while prioritising high-SLA work
(interactive thing
Both 2 and 3 are pretty good topics for master's project I think.
You can also look into how one can improve Spark's scheduler throughput.
Couple years ago Kay measured it but things have changed. It would be great
to start with measurement, and then look at where the bottlenecks are, and
see how