Hi,

I am looking for suitable issue for Master Degree project(it sounds like
scalability problems and improvements for spark streaming) and seems like
introduction of grouped RDD(for example: don't store
"Spark", "Spark", "Spark", instead store ("Spark", 3)) can:

1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
messages)
2. Improve performance(no need to apply function several times for the same
message).

Can I create ticket and introduce API for grouped RDDs? Is it make sense?
Also I will be very appreciated for critic and ideas

Reply via email to