Hey all, Was messing around with Spark and Google FlatBuffers for fun, and it got me thinking about Spark and serialization. I know there's been work / talk about in-memory columnar formats Spark SQL, so maybe there are ways to provide this flexibility already that I've missed? Either way, my thoughts:
Java and Kryo serialization are really nice in that they require almost no extra work on the part of the user. They can also represent complex object graphs with cycles etc. There are situations where other serialization frameworks are more efficient: * A Hadoop Writable style format that delineates key-value boundaries and allows for raw comparisons can greatly speed up some shuffle operations by entirely avoiding deserialization until the object hits user code. Writables also probably ser / deser faster than Kryo. * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the tradeoff between (1) Java objects that offer fast access but take lots of space and stress GC and (2) Kryo-serialized buffers that are more compact but take time to deserialize. The drawbacks of these frameworks are that they require more work from the user to define types. And that they're more restrictive in the reference graphs they can represent. In large applications, there are probably a few points where a "specialized" serialization format is useful. But requiring Writables everywhere because they're needed in a particularly intense shuffle is cumbersome. In light of that, would it make sense to enable varying Serializers within an app? It could make sense to choose a serialization framework both based on the objects being serialized and what they're being serialized for (caching vs. shuffle). It might be possible to implement this underneath the Serializer interface with some sort of multiplexing serializer that chooses between subserializers. Nothing urgent here, but curious to hear other's opinions. -Sandy