Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq messages) 2. Improve performance(no need to apply function several times for the same message). Can I create ticket and introduce API for grouped RDDs? Is it make sense? Also I will be very appreciated for critic and ideas