Ah awesome. Passing customer serializers when persisting an RDD is exactly one of the things I was thinking of.
-Sandy On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540 > (one of our older JIRAs). I think it would be interesting to explore this > further. Basically the way to add it into the API would be to add a version > of persist() that takes another class than StorageLevel, say > StorageStrategy, which allows specifying a custom serializer or perhaps > even a transformation to turn each partition into another representation > before saving it. It would also be interesting if this could work directly > on an InputStream or ByteBuffer to deal with off-heap data. > > One issue we've found with our current Serializer interface by the way is > that a lot of type information is lost when you pass data to it, so the > serializers spend a fair bit of time figuring out what class each object > written is. With this model, it would be possible for a serializer to know > that all its data is of one type, which is pretty cool, but we might also > consider ways of expanding the current Serializer interface to take more > info. > > Matei > > > On Nov 7, 2014, at 1:09 AM, Reynold Xin <r...@databricks.com> wrote: > > > > Technically you can already do custom serializer for each shuffle > operation > > (it is part of the ShuffledRDD). I've seen Matei suggesting on jira > issues > > (or github) in the past a "storage policy" in which you can specify how > > data should be stored. I think that would be a great API to have in the > > long run. Designing it won't be trivial though. > > > > > > On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > > > >> Hey all, > >> > >> Was messing around with Spark and Google FlatBuffers for fun, and it > got me > >> thinking about Spark and serialization. I know there's been work / talk > >> about in-memory columnar formats Spark SQL, so maybe there are ways to > >> provide this flexibility already that I've missed? Either way, my > >> thoughts: > >> > >> Java and Kryo serialization are really nice in that they require almost > no > >> extra work on the part of the user. They can also represent complex > object > >> graphs with cycles etc. > >> > >> There are situations where other serialization frameworks are more > >> efficient: > >> * A Hadoop Writable style format that delineates key-value boundaries > and > >> allows for raw comparisons can greatly speed up some shuffle operations > by > >> entirely avoiding deserialization until the object hits user code. > >> Writables also probably ser / deser faster than Kryo. > >> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address > the > >> tradeoff between (1) Java objects that offer fast access but take lots > of > >> space and stress GC and (2) Kryo-serialized buffers that are more > compact > >> but take time to deserialize. > >> > >> The drawbacks of these frameworks are that they require more work from > the > >> user to define types. And that they're more restrictive in the > reference > >> graphs they can represent. > >> > >> In large applications, there are probably a few points where a > >> "specialized" serialization format is useful. But requiring Writables > >> everywhere because they're needed in a particularly intense shuffle is > >> cumbersome. > >> > >> In light of that, would it make sense to enable varying Serializers > within > >> an app? It could make sense to choose a serialization framework both > based > >> on the objects being serialized and what they're being serialized for > >> (caching vs. shuffle). It might be possible to implement this > underneath > >> the Serializer interface with some sort of multiplexing serializer that > >> chooses between subserializers. > >> > >> Nothing urgent here, but curious to hear other's opinions. > >> > >> -Sandy > >> > >