Technically you can already do custom serializer for each shuffle operation (it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues (or github) in the past a "storage policy" in which you can specify how data should be stored. I think that would be a great API to have in the long run. Designing it won't be trivial though.
On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hey all, > > Was messing around with Spark and Google FlatBuffers for fun, and it got me > thinking about Spark and serialization. I know there's been work / talk > about in-memory columnar formats Spark SQL, so maybe there are ways to > provide this flexibility already that I've missed? Either way, my > thoughts: > > Java and Kryo serialization are really nice in that they require almost no > extra work on the part of the user. They can also represent complex object > graphs with cycles etc. > > There are situations where other serialization frameworks are more > efficient: > * A Hadoop Writable style format that delineates key-value boundaries and > allows for raw comparisons can greatly speed up some shuffle operations by > entirely avoiding deserialization until the object hits user code. > Writables also probably ser / deser faster than Kryo. > * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the > tradeoff between (1) Java objects that offer fast access but take lots of > space and stress GC and (2) Kryo-serialized buffers that are more compact > but take time to deserialize. > > The drawbacks of these frameworks are that they require more work from the > user to define types. And that they're more restrictive in the reference > graphs they can represent. > > In large applications, there are probably a few points where a > "specialized" serialization format is useful. But requiring Writables > everywhere because they're needed in a particularly intense shuffle is > cumbersome. > > In light of that, would it make sense to enable varying Serializers within > an app? It could make sense to choose a serialization framework both based > on the objects being serialized and what they're being serialized for > (caching vs. shuffle). It might be possible to implement this underneath > the Serializer interface with some sort of multiplexing serializer that > chooses between subserializers. > > Nothing urgent here, but curious to hear other's opinions. > > -Sandy >